BITS: Overview of important biological databases beyond sequences

Basic bioinformatics concepts, databases and tools

Module 4

Beyond the sequences

Dr. Joachim Jacob

http://www.bits.vib.be

Updated Nov 2011http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf

Module 4 broadens our view

To understand life, we need not only sequences, but many other concepts

Bioinformatics is also storing and analyzing− gene information: variations, isoforms,...

− Expression data

− 3D protein structure data

− Interaction data

− Pathways and network

“Storing all relevant biological data”

Schematic view II

GeneA sequence annotations – gene expr – pathway – struct,...

GeneB sequence annotations – gene expr – pathway – struct,...

GeneC sequence annotations – gene expr – pathway – struct,...

analysis

Primary databaseOther sequence databases

results

Additional information sources

results

The indispensable databases

Gene Ontology – structuring KEGG – biochemical pathways PDB – Structure of proteins Intact – Interaction data dbSNP – database of genomic variation Expression sources – Microarray data

Gene Ontology structures the way we communicate about life

http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax.pdf

http://www.arabidopsis.org/help/tutorials/go1.jsp

Gene translation Protein synthesisProtein production

Gene Ontology structures life

http://www.geneontology.org/

Agreement on standardized keywords (often referred to as 'controlled vocabularies'), describing all natural processes in an hierarchical way (ontology).

Keywords are assigned to genes based different evidence

Keywords are ordered in a hierarchical tree-like structure ( 'directed acyclic graphs')

Three GO 'trees' exists, describing:

"Biological Process"

"Cellular Component"

"Molecular Function"

http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax.pdf

http://www.arabidopsis.org/help/tutorials/go1.jsp

A gene can be given different GO terms

Example, cytochrome c:

molecular function: oxidoreductase activity,

biological process: oxidative phosphorylation and induction of cell death,

cellular component: mitochondrial matrix and mitochondrial inner membrane.

In each tree, the terms are organised in a directed acyclic graph: a network consisting of parents and child-terms (as nodes) and lines between them as relationships.

Different evidence codes can assign a degree of confidence to the assignment

http://www.geneontology.org/GO.evidence.shtml

Evidence codes can be grouped by: Experimental (e.g. IDA – inferred from direct assay)

Computational analysis

Author statement

Curator statement

Inferred from electronic annotation (IEA)

If available, each annotation has also a reference

Different evidence codes can assign a degree of confidence to the assignment

Gene Ontology structures all genes according to their biological significance

The GO structure and the terms can be browsed by a browser called AmiGO.

The Quick Go from EBI has some nice visualisation

Excellent GO-wiki for all your questions

GO can be used to retrieve all gene (products) related to one specific term

You can search broad, e.g. Amigo search for Diabetes leads to following GO term

http://amigo.geneontology.org/

Amigo search for Diabetes

GO is also useful to analyze and compare different gene lists

A lot of tools on GO are available on website.

http://www.geneontology.org/GO.tools.shtml

Some things to know about GO

For analyses, one can make use of 'shrinked' GO sets, the so-called GO-slims

– GO slims are a subset of biologically more relevant GO terms (available per species)

– GO ontologies can be downloaded in .obo format.

Not all information is captured by GO and need to be retrieved in other databases

Metabolic pathways: KEGG, …

Phenotype/diseases

• Mapping files exists e.g. kegg2go

http://www.geneontology.org/GO.slims.shtml

Biological pathways databases organise genes by molecular reactions

3 important databases on biological pathways

http://www.kegg.jp/

http://www.reactome.org/ - EBI

http://metacyc.org

Proteins with enzymatic function receive an Enzyme Commission (EC) number

http://www.chem.qmul.ac.uk/iubmb/enzyme/

EC 6 Ligases

EC 5 Isomerases

EC 4 Lyases

EC 3 Hydrolases

EC 2 Transferases

EC 1 Oxidoreductases

IntAct database contains interaction information of proteins

http://www.ebi.ac.uk/intact

Three types of interactions stored Protein-protein Protein-dna Protein-small molecule

IntAct database represents all interactions as binary: caution!

Interaction networks can be analysed on your computer using Cytoscape

Cytoscape training material on the BITS website

PDB hosts 3-dimensional structural data on molecules

PDB = Protein DataBankhttp://www.pdb.org/pdb/home/home.do

Only structures resolved through NMR and X-ray (or other accurate techniques)

Proteins DNA RNA Ligands

Understanding PDB data: tutorial

PDB files can be read by a lot of different tools to display the structure

Every entry in PDB contains its own PDB accession number (often 1 digit and three letters)

The PDB file contains 3D coordinates from every single atom in the structure, together with variability of that position (last two digits)

http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-analysis-training&catid=81:training-pages&Itemid=190

PDB files can be read by a lot of different tools to display the structure

Tools to visualize (and some to analyze structures) (see BITS wiki)

http://www.bits.vib.be/wiki/index.php/Protein_structure

To find a structure for your protein sequence is to search for similarity

Homology modeling

Similarity on sequence level projected to a structure Blast your query against PDB db by cblast , or at expasy

PSI-BLAST - can detect sequences with similar structures (twilight zone!)

If still no success: 3D-jury (a meta approach, including fold recognition and local structure prediction)

Similarity on structural level: aligning structures VAST (structure)

Distance mAtrix aLIgnment DALI

http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdfhttp://consurf.tau.ac.il/pe/protexpl/psbiores.htm

BITS training on protein structure analysis

Tools at EBI

Structural information is used to classify proteins

Groups proteins based on evolutionary, domain architecture and structural information.

Manually curated classification on protein domains

Database cross-references in PDB entry

http://scop.mrc-lmb.cam.ac.uk/scop/http://www.cathdb.info/

dbSNP is a public-domain archive for simple genetic polymorphisms

Single Nucleotide Polymorphism database (NCBI)

Each dbSNP entry has a code rsxx (RefSNP) or ssxx (submitted SNP) single-base nucleotide substitutions (also known as

single nucleotide polymorphisms or SNPs),

small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs)

retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs).

Synchronized with new genome builds

Expression data can be sequence-based or hybridisation-based

Sequence-based (ESTs - RNA seq - SAGE)

Digital gene expression/northern

Microarray databases – hybridisation based: GEO: gene expression omnibus (NCBI)

− Platform: GPLxxxxxxx

− Experiment: GSExxxxxx (= several samples)

− Sample: GSMxxxxxxxx

− Some experiments are curated: GDSxxxxx (online analysis possible)

ArrayExpress (EBI)

Example of expression data at GEO

Example at ArrayExpress

Entrez interconnects the databases at NCBI for easy querying

UniGene : sequences grouped by gene PopSet : sequence alignments for population

studies and phylogeny Structure : 3D structures (PDB) Genome : genomic maps of chromosomes and

plasmids UniSTS (Sequence Tagged Sites) PubMed : literature abstracts (MEDLINE,…) OMIM (Online Mendelian Inheritance in Man) :

literature reviews, Mesh (Medical Subject Headings) : keywords Taxonomy

Finding relevant data

Summarizing most important links to discover everything you need ...

Protein dataInterpro (heavily integrated with EBI resources)

http://www.interpro.org

Gene dataEntrez at NCBI : 'Entrez Gene'

http://www.ncbi.nlm.nih.gov/Entrez/

Ebeye Search at EBI : excellent for cross-species

http://www.ebi.ac.uk/ebisearch/

Hold back your horses!

Phew, where do I place this all?

Bioinformatics is all about different data, as versatile as life itself

Due to the strong cross-references between different databases, new databases and relevant info are rapidly integrated in existing databases.

You can discover them by taking time to read the entries.

New tools are emerging everyday to enable you to browse all data sources...

BioGPS, all in one window!

New tools are emerging everyday to enable you to browse all data sources...

Integrative resources are increasingly being organised on a species basis

EMAGE database of in situ gene expression in mouse

OMIM Database of diseases in man

Websites providing an interface to integrate all this data is increasingly important

Often organized on a species basis− TAIR

− Flybase

− Wormbase

The organizing biological data information by species

By species, why?

There is one biological information resource which stays

more or less unchanged per species ...

BITS: Overview of important biological databases beyond sequences

Education

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases

Distributed Searching in Biological Databasesfacweb.cti.depaul.edu/research/techreports//TR05-018.pdf · Biological databases storing DNA sequences, protein sequences, or mass spectra

Custom-Designed PDC Bits Steerable PDC Bits Tri-Cone Bits Product Catalogue Online.pdf · Product Catalogue Custom-Designed PDC Bits Steerable PDC Bits Tri-Cone Bits Stable, Durable

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE

Sequences in Databases COPYRIGHTED MATERIALSequences in Databases COPYRIGHTED MATERIAL The study of bioinformatics includes the analysis of proteins. In the ﬁrst half of the nineteenth

Proteins Structural Bioinformatics. 2 3 Specific databases of protein sequences and structures Swissprot PIR TREMBL (translated from DNA) PDB

Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

1 MIS 304 Winter 2006 Bits, Bytes, File Systems Data Modeling and Databases

New generation of patent sequence databases · Patent families GM671154 ADA42650 CS017585 ACQ13114 DI603183 AAR79155 DD649656 100% identical sequences Invention A Invention B HB492658

Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences

Lecture 2 – Biological Databases...The central dogma of molecular biology Structure of protein DNA sequences ! Genes are encoded in genomic sequences ! Genes are transcribed into

DLX Instruction Formatmeseec.ce.rit.edu/eecc551-winter2001/551-12-5-2001.pdf · DLX Instruction Format 6 bits 5 bits 5 bits 16 bits Opcode rs1 rd Immediate 6 bits 5 bits 5 bits 5

BITS: Basics of sequence databases

BITS - Overview of sequence databases for mass spectrometry data analysis

Sequences and Arithmetic Sequences

BITS: Introduction to relational databases and MySQL - SQL

Molecular Databases for Protein Sequences and Structure Studies: An Introduction

TETRA Rx Test Solution Product Introduction · TETRA signals, but with all modulating bits (including training sequences) derived directly from pseudo random bit sequence T3 Test

SEQUENCE DATABASES Daniel Svozil. Primary sequence databases All published genome sequences are available over the internet requirement of every scientific

Gateway to Tools of the HIV and HCV Databases · or HCV, it is labeled, “General,” while a tool that is applicable only to HIV or HCV sequences is labeled ... Sequence databases: