What is Bioinformatics?

What is Bioinformatics?

• Bioinformatics: collection and storage of biological information

• Computational biology: development of algorithms and statistical models to analyze biological data

Jobs for bioinformaticians

Databases make biological data available to scientists

• As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown tremendously. – Nucleotide, protein sequences– Protein structure– Expression data– Gene/protein networks

Nucleotide Databases

• EMBL www.ebi.ac.uk/embl/ – The EMBL (European Molecular Biology

Laboratory) nucleotide sequence database is maintained by the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, UK.

http://www.ebi.ac.uk/embl/

Nucleotide Databases cont.

• GenBank: maintained by the National Center for Biotechnology Information (NCBI); contains Entrez for accession to nucleotides, proteins, annotations, etc.www.ncbi.nlm.nih.gov/Genbank/

• UniGene: a non-redundant set of gene-oriented clusters www.ncbi.nlm.nih.gov/UniGene/

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/UniGene/

Protein Databases

• SWISS-PROT: SWISS-PROT is a protein sequence database to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. www.expasy.ch/sprot/

http://www.expasy.ch/sprot/

Protein Databases

• PIR http://pir.georgetown.edu/

-The Protein Information Resource (PIR) is a division of the National Biomedical Research Foundation (NBRF) in the US. It is involved in a collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japanese International Protein Sequence Database (JIPID). Release 67.00 (31 Dec 2000) contains 198,801 entries.

http://pir.georgetown.edu/

Sequence Motif Databases

• Pfam www.sanger.ac.uk/Software/Pfam/

– Pfam is a database of protein families defined as domains (contiguous segments of entire protein sequences). For each domain, it contains a multiple alignment of a set of defining sequences (the seeds) and the other sequences in SWISS-PROT that can be matched to that alignment.

http://www.sanger.ac.uk/Software/Pfam/

3D-Structure Databases

• PDB www.rcsb.org/pdb/

-The PDB is the main primary database for 3D structures of biological macromolecules determined by X-ray crystallography and NMR. Structural biologists usually deposit their structures in the PDB on publication, and some scientific journals require this before accepting a paper. It also accepts the experimental data used to determine the structures.

http://www.rcsb.org/pdb/

How to get sequences?

• Entrez Database provides nucleotide and protein sequences in different formats.

• One of the formats is FASTA

FASTA FORMAT

• Each sequence begins with a description line ‘>’

A protein in FASTA format>HBA_ALLMI

VLSMEDKSNVKAIWGKASGHLEEYGAEALEMFCAYPQTKIYFPHFDMSHNSAQIRAHGKKVFSALHEAVNHIDDLPGALCRLSELHAHSLRVDPVNFKFLAHCVLVVFAIHHPSALSPEIHASLDKFLCAVSAVLTSKYR

• The first line is the description line, starts with a character '>' shows that the description line of a sequence follows the string following the '>' and ending at the first space (' ') is the sequence id (HBA_ALLMI).

A DNA sequence in Fasta

>X sequence

ATGAATAGCACAGAGAGACCAAGAGAGAGAGAGAGACCCAGATATATCAGATAGAGA

Why align sequences?

• Find evolutionary relationship between species and/or genes.

• Identify novel genes and define similar genes in other species.

• Study genomes and how they change.

Sequence Alignment

• Homology means that two (or more) sequences have a common ancestor.

• An example to sequence alignment

Sequence 1

Sequence 2

CLUSTALW: A software for aligning sequences

http://www.ebi.ac.uk/clustalw/

http://www.ebi.ac.uk/clustalw/

Genome Databases

• www.ensembl.org

http://www.ensembl.org/

Genome Databases: Gene Prediction

• Define the location of genes (coding sequences, regulatory regions)

• Gene prediction using software based on rules and patterns. Find Open Reading Frames (ORFs), with additional criteria for good start sequence for a gene.

• Gene identification through alignment with known proteins and EST sequences (Expressed Sequence Tags; mRNA sequences).

• Gene prediction through similarity with proteins or ESTs in other organisms.

• Gene prediction through comparison with other genomes; conserved regions are probably coding or regulatory regions.

Genome Databases: Annotation• Annotation of the genes: Compare with genes/proteins of

known function in other organisms.

• Functional classification. Broad groups of functional characterization, such as 'ribosomal proteins', 'nucleotide metabolism', 'signal transduction'.

Genome Databases: Evolution• Evolutionary history • Genome duplications• Gene loss

Transcription Databases• Microarrays can analyze 1000s of transcripts simultaneously.

– Allow analysis of genes that are high or low in expression between normal and disease, for example.

• Microarray Databases contain expression data (large amounts).– Stanford Microarray Database:

Signaling & Metabolic Pathways• Analyze how genes/proteins interact and learn about function of genes

– KEGG: Kyoto Encyclopedia of Genes and Genomes– http://www.genome.ad.jp/kegg/

http://www.genome.ad.jp/kegg/

Documents

What is Bioinformatics?