27
BIOINFORMARICS SEQUENCE FILE FORMATS Presented By: Alphy Joseph Date: 03 March 2016

Sequence file formats

Embed Size (px)

Citation preview

Page 1: Sequence file formats

BIOINFORMARICS SEQUENCE FILE

FORMATS

Presented By: Alphy JosephDate: 03 March 2016

Page 2: Sequence file formats

Important file formats• Genbank• FASTA• PIR• ALN/ClustalW2• GCG/MSF

Page 3: Sequence file formats

Early Data Formats• These early databases stored sequence data

in a file. The file held the sequence in ASCII (plain)text and had a descriptive filename.

• This method became limiting when researchers wanted to include annotations and information about the source of the sequence.

• Difficulty in searching for sequences was also an issue.

Page 4: Sequence file formats

Flat File Storage Data Formats

• When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards.

• The PIR also adopted a similar format for protein sequences

Page 5: Sequence file formats

• The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.

Page 6: Sequence file formats
Page 7: Sequence file formats
Page 8: Sequence file formats
Page 9: Sequence file formats
Page 10: Sequence file formats

FASTA Format• Bioinformaticists have developed a

standard format for nucleotide and protein sequences that allows them to be read by a wide range of programs. This format is called FASTA format.

• FASTA format each nucleotide or amino acid is represented using a single letter.

Page 11: Sequence file formats

• The first line of a FASTA is the comment line, identified with either the greater than symbol ‘>’. This line identifies the sequence and includes the accession number from NCBI, Genbank or another repository.

• The remaining lines contain the sequence,in lines of 80 or 120 characters per line.

Page 12: Sequence file formats
Page 13: Sequence file formats

PIR FORMAT• A sequence in PIR format consists of:

– One line starting with• a ">" (greater-than) sign, followed by• a two-letter code describing the

sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by

• a semicolon, followed by• the sequence identification code (the

database ID-code).

Page 14: Sequence file formats

– One line containing a textual description of the sequence.

– One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character.

– Optionally, this can be followed by one or more lines describing the sequence. Software that is supposed to read only the sequence should ignore these.

Page 15: Sequence file formats

• A file in PIR format may comprise more than one sequence.

• The PIR format is also often referred to as the NBRF format.

Page 16: Sequence file formats
Page 17: Sequence file formats

ALN/ClustalW• The first line in the file must start with the words

"CLUSTALW". Other information in the first line is ignored.

• One or more empty lines.• One or more blocks of sequence data. Each block consists

of:– One line for each sequence in the alignment. Each line consists

of:• the sequence name• white space• up to 60 sequence symbols.• optional - white space followed by a cumulative count of residues for

the sequences– .

Page 18: Sequence file formats

– A line showing the degree of conservation for the columns of the alignment in this block.

– One or more empty lines• Some rules about representing sequences:• Case doesn't matter.• Sequence symbols should be from a

valid alphabet.• Gaps are represented using hyphens ("-").

Page 19: Sequence file formats

• The characters used to represent the degree of conservation are

* -all residues or nucleotides in that column are identical

: - conserved substitutions have been observed

. -semi-conserved substitutions have been observed

- no match.

Page 20: Sequence file formats
Page 21: Sequence file formats

GCG/MSF• msf formatted multiple sequence files

are most often created when using programs of the GCG suite.

• msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file.

• You can specify a single sequence or many sequences within an msf file.

Page 22: Sequence file formats

• Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:

• Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences.

• Do not edit or delete the file type if its present.

Page 23: Sequence file formats

• A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.

• A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.

Page 24: Sequence file formats

• msf files contain some other information as well:

• Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).

• Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.

Page 25: Sequence file formats

• Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences

Page 26: Sequence file formats
Page 27: Sequence file formats

THANK YOU