Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational...

Preview:

Citation preview

Databases

Protein Structure and Bioinformatics

Group

7 Oct 2016 2

Purpose of the lecture

● provide an overview of available databases● what are they for?● the contents of the most important databases● how to query these databases● make you aware of drawbacks and pitfalls

7 Oct 2016 3

Overview

● intro on databases● database models● overview of biological databases● details of often used databases and/or providers● some remarks on data quality

7 Oct 2016 4

Why databases?

● Exponential growth of:– sequences

– structures

– literature

● Need for efficient storage and management tools● Need for standardization

7 Oct 2016 5

Solution: databases

● coherent, consistent, designed for special purpose● data model: clearly defined data structure● database management system: easy access and

management

7 Oct 2016 6

What is a database

● any organized collection of data– card filing system

– telephone book

● now: A collection of information organized in such a way that a computer program can quickly select desired pieces of data.

● you need: Database Management System (DBMS)

7 Oct 2016 7

Database modelslogical structure of a database

● flat file● relational model (most used)● other:

– object-oriented, XML, hierarchical, network

● Database Management Systems (DBMS) include: MySQL, PostgreSQL, SQLite, Microsoft SQL Server,Oracle, SAP, dBASE, FoxPro, IBM DB2, LibreOffice Base and FileMaker Pro

7 Oct 2016 8

Flat file

● written in plain text, standard defined format● often tab-delimited or comma-separated text files● each line is a record● fields are separated by delimiters: tabs, commas● searching only sequential

7 Oct 2016 9

DNA and protein sequences in FASTA format

>gi|71902539|ref|NM_000051.3| Homo sapiens ataxia telangiectasia mutated (ATM), mRNACCGGAGCCCGAGCCGAAGGGCGAGCCGCAAACGCTAAGTCGCTGGCCATTGGTGGACATGGCGCAGGCGCGTTTGCTCCGACGGGCCGAATGTTTTGGGGCAGTGTTTTGAGCGCGGAGACCGCGTGATACTGGATGCGCATGGGCATACCGTGCTCTGCGGCTGCTTGGCGTTGCTTCTTCCTCCAGAAGTGGGCGCTGGGCAGTCACGCAGGGTTTGAACCGGAAGCGGGAGTAGGTAGCTGCGTGGCTAACGGAGAAAAGAAGCCGTGGCCGCGGGAGGAGGCGAGAGGAGTCGGGATCTGCGCTGCAGCCACCGCCGCGGTTGATACTACTTTGACCTTCCGAGTGCAGTGACAGTGATGTGTGTTCTGAAATTGTGAACCATGAGTCTAGTACTTAATGATCTGCTTATCTGCTGCCGTCAACTAGAACATGATAGAGCTACAGAACGAAAGAAAGAAGTTGAGAAATTTAAGCGCCTGATTCGAGATCCTGAAACAATTAAACATCTAGATCGGCATTCAGATTCCAAACAAGGAAAATATTTGAATTGGGATG

>gi|71902540|ref|NP_000042.3| serine-protein kinase ATM [Homo sapiens]MSLVLNDLLICCRQLEHDRATERKKEVEKFKRLIRDPETIKHLDRHSDSKQGKYLNWDAVFRFLQKYIQKETECLRIAKPNVSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQELLNYIMDTVKDSSNGAIYGADCSNILLKDILSVRKYWCEISQQQWLELFSVYFRLYLKPSQDVHRVLVARIIHAVTKGCCSQTDGLNSKFLDFFSKAIQCARQEKSSSGLNHILAALTIFLKTLAVNFRIRVCELGDEILPTLLYIWTQHRLNDSLKEVIIELFQLQIYIHHPKGAKTQEKGAYESTKWRSILYNLYDLLVNEISHIGSRGKYSSGFRNIAVKENLIELMADICHQVFNEDTRSLEISQSYTTTQRESSDYSVPCKRKKIELGWEVIKDHLQKSQNDFDLVPWLQIATQLISKYPASLPNCELSPLLMILSQLLPQQRHGERTPYVLRCLTEVALCQDKRSNLESSQKSDLLKLWNKIWCI

7 Oct 2016 10

Relational database

● database is composed of tables● each table has records (rows)● each record has fields (columns)● relational:

– tables hold logically related sets of data

– each record has a unique identifier: primary key

– relations between tables through keys

7 Oct 2016 11

Relational database

● PK = primary key, unique identifier

● FK = foreign key, connects to primary key in Customer table

7 Oct 2016 12

Aspects of relational databases● tables hold logically related sets of data● order of rows irrelevant (random access!)● rows are unique: no duplication of information● searching is specifying what you want:

– which field(s) from which table(s) under which condition(s)

– SQL (Structured Query Language)

● searching speed can be increased by using indexes

7 Oct 2016 13

Querying a database with SQL

7 Oct 2016 14

How to access databases● Web-based Graphical Users Interfaces (GUI)

– you do not see the underlying database structure

– output defined by host/provider

● File Transfer Protocol (FTP)– mostly flat files

● Application Programmers Interface (API)– you will approach database programmatically

through web services (SOAP/REST)

7 Oct 2016 15

Biological database providers/host

● EBI European Bioinformatics Institute

● SIB Swiss Institute of Bioinformatics

● NCBI National Center for Biotechnology

Information

● DDBJ DNA Databank of Japan

7 Oct 2016 16

Classification of biological databases

Primary: hold experimentally derived data● experimental data repositories● sequence databases● structure databases

7 Oct 2016 17

Classification of biological databases

Secondary: derived information from primary databases

● sequence related● genome related● structure related● expression data (RNA, protein)● pathway information

7 Oct 2016 18

Experimental data repositories

● Gene Expression Omnibus (GEO)● ArrayExpress● European Nucleotide Archive (ENA)

7 Oct 2016 19

Primary sequence databases

DNA/nucleotide sequences

Ensembl (EBI/Wellcome Trust Sanger Inst.)

GenBank (NCBI)

DNA Data Bank of Japan (DDBJ)

European Nucleotide Archive (EMBL-EBI)

7 Oct 2016 20

Primary sequence databases

protein sequences

UniProtKB UniProt Knowledge Base– UniProtKB/Swiss-Prot

– UniProtKB/TrEMBL

NCBI Protein

7 Oct 2016 21

Primary structure databases

Protein Data Bank (PDB)

Nucleic Acid Database

Cambridge Structural Database

7 Oct 2016 22

Secondary databases

● sequence related

– ProSite

– Pfam

– Enzyme

– REBase (restriction enzymes)

7 Oct 2016 23

Secondary databases

● genome related

Online Mendelian Inheritance in Man

TRANSFAC (transcription factors)

7 Oct 2016 24

Secondary databases● structure related

– DSSP Database of Secondary Structure Assignments

– HSSP Homology-derived Secondary Structure of Proteins

– Dali: comparing protein structures in 3D

7 Oct 2016 25

Secondary databases● expression data

– Expression Atlas

– Human Protein Atlas● pathway related

– KEGG: Kyoto Encyclopedia of Genes and Genomes

7 Oct 2016 26

Databases on Human Genes and Diseases

● General human genetics databases

e.g. HGMD

● General polymorphism databases

e.g NCBI SNP (dbSNP)

● Cancer gene and variant databases

e.g. COSMIC, Cancer Genome Atlas

7 Oct 2016 27

Databases on Human Genes and Diseases

● Gene-, system- or disease-specific databases– Locus-Specific DataBases, see e.g. HGVS

http://www.hgvs.org

– Disease-specific, e.g. IDbases: locus-specific databases for immunodeficiency-causing variations http://structure.bmc.lu.se/idbase/

– System-specific, e.g. GWASCatalog: genome-wide association studies

7 Oct 2016 28

Databases on Human Genes and Diseases

● Online Mendelian Inheritance in Man

7 Oct 2016 29

Locus-Specific Databases (LSDBs) list at www.hgvs.org/locuc-specific-

mutation-databases

7 Oct 2016 30

IDbases atstructure.bmc.lu.se/idbase

7 Oct 2016 31

BTKbase at LOVD.nl

7 Oct 2016 32

Nucleic Acids Research

● The NAR on line Molecular Biology Database Collection is published in the Database issue each year

● 2016: 1685 listings● URL: http://www.oxfordjournals.org/nar/database/c/

7 Oct 2016 33

7 Oct 2016 34

Wikipedia

URL: http://en.wikipedia.org/wiki/List_of_biological_databases

7 Oct 2016 35

PubMed

● The access point to medicine related publications● PubMed comprises more than 26 million citations

for biomedical literature

URL: http://www.ncbi.nlm.nih.gov/pubmed

7 Oct 2016 36

Some examples

● NCBI● UniProtKB/Swiss-Prot● PDB● Ensembl

7 Oct 2016 37

NCBIhttps://www.ncbi.nlm.nih.gov/

7 Oct 2016 38

NCBI Genetics & Medicine

7 Oct 2016 39

NCBI Handbook

7 Oct 2016 40

NCBI search

7 Oct 2016 41

NCBI Gene: download settings

7 Oct 2016 42

NCBI Gene: display settings

7 Oct 2016 43

NCBI Gene: Genomic regions etc.

7 Oct 2016 44

NCBI Gene: Reference sequences

7 Oct 2016 45

NCBI Gene: Reference sequences

7 Oct 2016 46

NCBI Gene: Reference sequences

7 Oct 2016 47

NCBI Gene: Reference sequences

information about the fields in GenBank records can be found at:

● NCBI handbook● https://www.ncbi.nlm.nih.gov/genbank/samplerecord/

7 Oct 2016 48

NCBI Gene: Reference sequences

7 Oct 2016 49

NCBI Gene: Reference sequences

7 Oct 2016 50

NCBI Gene: Reference sequences

7 Oct 2016 51

NCBI dbSNP: short genetic variations

7 Oct 2016 52

UniProtwww.uniprot.org

7 Oct 2016 53

UniProtKB/Swiss-Prot

7 Oct 2016 54

UniProtKB/Swiss-Prot

7 Oct 2016 55

Protein Data Bank in Europe (PDBe)

7 Oct 2016 56

Protein Data Bank (in Japan)

7 Oct 2016 57

Protein Data Bank (in Japan)

7 Oct 2016 58

Protein Data Bank

7 Oct 2016 59

Ensemblwww.ensembl.org

7 Oct 2016 60

7 Oct 2016 61

Ensembl variants

7 Oct 2016 62

KEGGintegrating genomic and chemical

information with systems information

7 Oct 2016 63

KEGG Pathways

7 Oct 2016 64

Some remarks about data quality

● how up-to-date is the database● is the database hand-curated by experts● when using data from a database, try to check these● be aware of the fact that there can be always errors

somewhere

7 Oct 2016 65

Example of checking data

● checking variant descriptions can be done with the Mutalyzer Name Checker tool: https://mutalyzer.nl

● Name Checker takes a complete sequence variant description (e.g. NM_000061.2:c.214A>G)

● variant description will be checked if it is according to HGVS rules

7 Oct 2016 66

Example of checking data

7 Oct 2016 67

Mutalyzer Name Checker

7 Oct 2016 68

Mutalyzer Name Check result (part)

7 Oct 2016 69

Thanks

● Protein Structure and Bioinformatics Group● BMC B13● gerard.schaafsma@med.lu.se● http://structure.bmc.lu.se

Recommended