Upload
anas-rafiq
View
217
Download
0
Embed Size (px)
Citation preview
8/12/2019 2nd Lec Student Copy_3
1/19
Introduction to Bioinformatics
2ndlecture
Muhammad Muddassir Ali
IBBT
mailto:[email protected]:[email protected]8/12/2019 2nd Lec Student Copy_3
2/19
Topics of this lecture
Course aims and learning goals
Databases (Primary and secondary) Different Sequence Formats
NCBI
8/12/2019 2nd Lec Student Copy_3
3/19
Learning outcomes of the lecture
The student should be able to:
Extensively define concepts of databases and its types
Understand the NCBI and its related basic information
8/12/2019 2nd Lec Student Copy_3
4/19
Database just a collection of data???
DB = structured (organized) collection of data
Must have:
Routines for adding new entries and updating old entries
Methods for handling user queries, i.e. access to the data
Libraries of life sciences information, collected from scientific experiments,published literature, high-throughput experiment technology, and computational
analyses.
contain information from research areas including genomics,proteomics,
metabolomics, microarraygene expression, andphylogenetics.
information contained in biological databases includes gene function,
structure, localization (both cellular and chromosomal), clinical effects of
mutations as well as similarities of biological sequences and structures.
http://en.wikipedia.org/wiki/Genomicshttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Metabolomicshttp://en.wikipedia.org/wiki/Microarrayhttp://en.wikipedia.org/wiki/Phylogeneticshttp://en.wikipedia.org/wiki/Phylogeneticshttp://en.wikipedia.org/wiki/Microarrayhttp://en.wikipedia.org/wiki/Metabolomicshttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Genomics8/12/2019 2nd Lec Student Copy_3
5/19
Types of DBs:
Relational Object-oriented
Flat file, i.e. all DB entries stored
in text file(s)
Many biological DBs are in flat file format:
Historical reasons (thats what biologists started
with)
Easy to distribute, download and store
No need for database management software
On the basis of format
style in which data is
stored
8/12/2019 2nd Lec Student Copy_3
6/19
Types of DBs:
Primary
secondary database
On the basis kind of data
included
Biological
experiments
Secondary database
Primary database analysis
8/12/2019 2nd Lec Student Copy_3
7/19
Exemples
Primary database
Gene bank
Entrez
EMBL and DDBJ
The Sequence Retrieval System
SWISSPROT
Uniprot
Secondary Databases
prosite
prints pfam
interpro
8/12/2019 2nd Lec Student Copy_3
8/19
Sequence formats
Plain sequence format
In plain sequence format may only contain onesequence, while most other formats
accept several sequences in one file.
An example sequence in plain format is:
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC
CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGAC
AGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTG
ACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGC
CCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTT
CTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGT
CACGCAAG TTTAATTACAGACCTGAA
8/12/2019 2nd Lec Student Copy_3
9/19
EMBL format contain several sequences.
One sequence entry starts with an identifier line ("ID"), followed by further
annotation lines.
line starting with "SQ" and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is:
ID AB000263 standard; RNA; 240BP.
XX
AC AB000263;
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 240 BP;
acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60
ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120
caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 180
aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 240
//
8/12/2019 2nd Lec Student Copy_3
10/19
FASTA format
fast A.. all, for protein, P and Nucleotide N can contain severalsequences.
Each sequence begins with a single-line description, followed by
lines of sequence data.
The description line must begin with a greater-than (">")symbolin the first column.
>AB000263 |acc=AB000263|descr=Homo sapiens mRNA for
prepro cortistatin like peptide, complete cds.|len=368
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGG
GTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAA
GCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCT
CGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGG
GCCCCTCATAGGAGAGG
8/12/2019 2nd Lec Student Copy_3
11/19
GCG format
Contains exactly one sequence Begins with annotation lines
Start of the sequence is marked by a line ending with
two dot ("..") characters. sequence identifier, the
sequence length.
only be used if the file was created with the GCG
package.
8/12/2019 2nd Lec Student Copy_3
12/19
An example sequence in GCG format is
ID AB000263 standard; RNA; 121 BP.
XX
AC AB000263
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 121 BP;
AB000263 Length: ..
1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
8/12/2019 2nd Lec Student Copy_3
13/19
8/12/2019 2nd Lec Student Copy_3
14/19
GenBank format
can contain several sequences. starts with a line containing the word LOCUSand a number of annotation
lines. The start of the sequence is marked by a line containing "ORIGIN"
and the end of the sequence is marked by two slashes ("//").
LOCUSAB000263 181 bp mRNA linear PRI 05-FEB-1999
DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide,complete cds.
ACCESSION AB000263
Origin
1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg
61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag
//
8/12/2019 2nd Lec Student Copy_3
15/19
NCBIAt NCBI (National Center for
Biotechnology Information)
Founded 1982
Nucleotide sequencesAnyone can add new data:
bankItweb-based submission
of a single sequence
sequinsoftware for larger
submissions
Scientific journals require sequence submission
8/12/2019 2nd Lec Student Copy_3
16/19
Locus
Unique, but may change
Shows organism
Here, SCU = S. cerevisiae Accession number
Unique and permanent
No info, just a number
Most reliable identification
CDS
Coding sequence
Exons
Translation shown
Origin
The actual nucleotide seq
8/12/2019 2nd Lec Student Copy_3
17/19
8/12/2019 2nd Lec Student Copy_3
18/19
8/12/2019 2nd Lec Student Copy_3
19/19
Summary
Databases
Types of databases
Sequence formats NCBI