31
SEQUENCE FILE FORMATS 1

sequence of file formats in bioinformatics

Embed Size (px)

DESCRIPTION

methods and tools

Citation preview

Page 1: sequence of file formats in bioinformatics

1

SEQUENCE FILE FORMATS

Page 2: sequence of file formats in bioinformatics

2

introduction

Data is stored in a biological database in the form of sequences or molecular form

Unique file format Representation of data in biological

database Categories of file formats

Sequence database Molecular database

Page 3: sequence of file formats in bioinformatics

3

Sequence file formats

Gene bank flat-file Format FASTA Format Multi-FASTA Format GCG Format GCG-MSF Format EMBL Format Clustal Format SWIS PROT format

Page 4: sequence of file formats in bioinformatics

4

Gene bank flat-file Format

Used by NCBI It is divided into three parts Header just a direct and very precise

or brief introductory part Features

all genes in seq., location of genes in genome, protein product and coding genes etc. Sequence : ORIGIN atcgatcgatgcgctat

//

Page 5: sequence of file formats in bioinformatics

5

Description of gene bank flat file identifiers

HEADRES Locus Definition Accession Version Dbsource: dates for creation and modifications Keywords Source Organism References Authors Title Journal Medline ID: all published sources Comment FEATURES SEQUENCE

Page 6: sequence of file formats in bioinformatics

6

Retrieved from ncbi

Page 7: sequence of file formats in bioinformatics

7

Page 8: sequence of file formats in bioinformatics

8

Page 9: sequence of file formats in bioinformatics

9

Fasta format

One line header Stats with > followed by name of gene Sequence of gene or protein

Blank spaces Paragraph marks Numerals

Are all ignored Steric sign * at the end

Page 10: sequence of file formats in bioinformatics

10

FASTA Format

>p53 ctcgaggggc ctagacattg ccctccagag agagcaccca acaccctcca ggcttgaccg 61 gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc 121 tgggacacca gctggccttc aaggtctctg cctccctcca gccaccccac tacacgctgc 181 tgggatcctg gatctcagct ccctggccga caacactggc aaactcctac tcatccacga 241 aggccctcct gggcatggtg gtccttccca gcctggcagt ctgttcctca cacaccttgt 301 tagtgcccag cccctgaggt tgcagctggg ggtgtctctg aagggctgtg agcccccagg 361 aagccctggg gaagtgcctg ccttgcctcc ccccggccct gccagcgcct ggctctgccc*

Page 11: sequence of file formats in bioinformatics

11

Page 12: sequence of file formats in bioinformatics

12

Multi-FASTA Format

Just like an aggregation of FASTA file as listed above

Multiple sequences follow one after the other

Single file Accepted by several databases Clustal W Multalin

Page 13: sequence of file formats in bioinformatics

13

MULTI FASTA format

> jhumagccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc >bhuma

gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc >puma

gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc >zuma

gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc

Page 14: sequence of file formats in bioinformatics

14

Page 15: sequence of file formats in bioinformatics

15

GCG Format

GCG: genetics computer group First line says it all …. !!N.A_SEQUENCE 1.0 !!AA_SEQUENCE 1.0 Just a simple format in which we just

get to now the sequence for the genes or proteins

Page 16: sequence of file formats in bioinformatics

16

GCG format

Page 17: sequence of file formats in bioinformatics

17

GCG-MSF Format

Multiple sequences Sequence name Sequences Alignment Word pileup indicates that It is a multiple

sequence containing file Mandatory MSF word indicated in the file that

tells that it is an MSF GCG file and is not just GCG Comments terminated with // 2 consecutive blank lines Multiple sequences

Page 18: sequence of file formats in bioinformatics

18

GCG MSF Format

Page 19: sequence of file formats in bioinformatics

19

EMBL Format

Sequence format of European molecular biology laboratory database

Starts with ID identification number Ends with // as terminator Different lines with own format Used to record various forms of data i.e DNA, RNA, GENE, PROTEIN etc etc

Page 20: sequence of file formats in bioinformatics

20

EMBLformat

Page 21: sequence of file formats in bioinformatics

21

Clustal Format

Most widely used sequence alignment tool

CLUSTAL W CLUSTAL X Aligned protein or gene sequences

Page 22: sequence of file formats in bioinformatics

22

Clustal x

Page 23: sequence of file formats in bioinformatics

23

SWIS PROT format

Protein sequence database ID : identification number AC: accession number DE: description GN: gene name OS: organism specie OG: organelle OC: organism classification OX: organism taxonomy cross reference RN: reference number RP: reference position

Page 24: sequence of file formats in bioinformatics

24

Continued…

RC: reference comment RX: reference cross reference RA: reference author RT: reference title RL: reference location CC: blank DR: database cross reference KW: key word FT: feature table SQ: sequence //

Page 25: sequence of file formats in bioinformatics

25

Page 26: sequence of file formats in bioinformatics

26

Sequence conversion tools

Several software's have been designed by … ?

The aim of these software's is to make a detailed conversion of one sequence format into another

Some of the software used widely for sequence inter-conversion are :

ReadSeq GCG SeqVerter Seqret

Page 27: sequence of file formats in bioinformatics

27

Read Seq

Developed by Dr. D.G Gilbert Automated conversion 18 supported file formats are there

which can be interconverted into one another

Page 28: sequence of file formats in bioinformatics

28

Page 29: sequence of file formats in bioinformatics

29

Page 30: sequence of file formats in bioinformatics

Assignment

FASTA Multi FASTA Flat file GCG format EMBL Clustal SWISS PROT

Make each file by this Friday and send as attachments in an email 30

Page 31: sequence of file formats in bioinformatics

31

Molecular file formats

continued…