Sequence file formats

BIOINFORMARICS SEQUENCE FILE

FORMATS

Presented By: Alphy JosephDate: 03 March 2016

Important file formats• Genbank• FASTA• PIR• ALN/ClustalW2• GCG/MSF

Early Data Formats• These early databases stored sequence data

in a file. The file held the sequence in ASCII (plain)text and had a descriptive filename.

• This method became limiting when researchers wanted to include annotations and information about the source of the sequence.

• Difficulty in searching for sequences was also an issue.

Flat File Storage Data Formats

• When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards.

• The PIR also adopted a similar format for protein sequences

• The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.

FASTA Format• Bioinformaticists have developed a

standard format for nucleotide and protein sequences that allows them to be read by a wide range of programs. This format is called FASTA format.

• FASTA format each nucleotide or amino acid is represented using a single letter.

• The first line of a FASTA is the comment line, identified with either the greater than symbol ‘>’. This line identifies the sequence and includes the accession number from NCBI, Genbank or another repository.

• The remaining lines contain the sequence,in lines of 80 or 120 characters per line.

PIR FORMAT• A sequence in PIR format consists of:

– One line starting with• a ">" (greater-than) sign, followed by• a two-letter code describing the

sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by

• a semicolon, followed by• the sequence identification code (the

database ID-code).

– One line containing a textual description of the sequence.

– One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character.

– Optionally, this can be followed by one or more lines describing the sequence. Software that is supposed to read only the sequence should ignore these.

• A file in PIR format may comprise more than one sequence.

• The PIR format is also often referred to as the NBRF format.

ALN/ClustalW• The first line in the file must start with the words

"CLUSTALW". Other information in the first line is ignored.

• One or more empty lines.• One or more blocks of sequence data. Each block consists

of:– One line for each sequence in the alignment. Each line consists

of:• the sequence name• white space• up to 60 sequence symbols.• optional - white space followed by a cumulative count of residues for

the sequences– .

– A line showing the degree of conservation for the columns of the alignment in this block.

– One or more empty lines• Some rules about representing sequences:• Case doesn't matter.• Sequence symbols should be from a

valid alphabet.• Gaps are represented using hyphens ("-").

• The characters used to represent the degree of conservation are

* -all residues or nucleotides in that column are identical

: - conserved substitutions have been observed

. -semi-conserved substitutions have been observed

- no match.

GCG/MSF• msf formatted multiple sequence files

are most often created when using programs of the GCG suite.

• msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file.

• You can specify a single sequence or many sequences within an msf file.

• Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:

• Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences.

• Do not edit or delete the file type if its present.

• A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.

• A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.

• msf files contain some other information as well:

• Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).

• Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.

• Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences

THANK YOU

Science

Sequence file formats