View
4
Download
0
Category
Preview:
Citation preview
PharmaMatrix Workshop 2010 Bioinforma6c Databases
14 July 2010 Philip Winter & Ishwar Hosamani
Database Growth
Source: http://www.kokocinski.net/bioinformatics/databases.php
Genes & Proteins
Gene & Protein Interac=ons
Cheminforma=cs: Drugs & Metabolites
Database Survey
Genes & Proteins
Gene & Protein Interac=ons
Cheminforma=cs: Drugs & Metabolites
UniProt
GenBank
dbSNP
PDB
GEO
Pfam TGI
Database Survey
Genes & Proteins
Gene & Protein Interac=ons
Cheminforma=cs: Drugs & Metabolites
ZINC
UniProt
GenBank
DrugBank
dbSNP
PDB
GEO
PubChem
SciFinder
Pfam TGI
Database Survey
Database Survey
Genes & Proteins
Gene & Protein Interac=ons
Cheminforma=cs: Drugs & Metabolites
ZINC
UniProt
GenBank
DrugBank
dbSNP
PDB
GEO
PubChem
SciFinder
BioGRID
Pfam TGI
KEGG
NetPath
Cura6on
• Manual cura6on (or just cura6on): A human creates and annotates the database entry
• Automa6c cura6on: A computer program creates and annotates the database entry
• Semi-‐automa6c cura6on: A combina=on of manual and automa=c
Database Iden6fiers
• Every database record will have a unique iden6fier; oUen this will be called an accession number which is assigned with the record is first added to the database
• Be careful: databases will oUen permit a record to be modified but keep the same accession number; you should record the version number as well
• Furthermore, databases may have different rules for handling records that are merged or split
Database Iden6fier Cheat Sheets
PaMern Iden6fier Name
Examples En6ty Database URL
[op=onal “GI:”][digits]
GenInfo Iden=fier
GI:34222261
Nucleo=de or protein sequence
GenBank, RefSeq
h`p://www.ncbi.nlm.nih.gov/
[le`er][5 digits] OR [2 le`ers][6 digits]
GenBank ACCESSION
AB088100 Nucleo=de sequence
GenBank h`p://www.ncbi.nlm.nih.gov/
[2 le`er type code]_[digits]
RefSeq ACCESSION
NM_178014 Nucleo=de or protein sequence
RefSeq h`p://www.ncbi.nlm.nih.gov/
[GenBank or RefSeq ACCESSION].[version number]
GenBank or RefSeq VERSION
AB088100.1
NM_178014.2
Nucleo=de or protein sequence
GenBank, RefSeq
h`p://www.ncbi.nlm.nih.gov/
(iden=cal to accession for recent entries)
GenBank LOCUS
Nucleo=de or protein sequence
GenBank, RefSeq
h`p://www.ncbi.nlm.nih.gov/
PaMern Iden6fier Name
Examples En6ty Database URL
[Protein code]_[Species code]
Swiss-‐Prot ID (entry name)
TBB5_ HUMAN
Protein sequence
UniProtKB/Swiss-‐Prot
h`p://www.uniprot.org/
[UniProt AC]_[Species code]
UniProt ID (entry name)
Q9BUU9_ HUMAN
Protein sequence
UniProtKB/TrEMBL
h`p://www.uniprot.org/
[A-‐N,R-‐Z][0-‐9][A-‐Z][A-‐Z, 0-‐9][A-‐Z, 0-‐9][0-‐9] OR [O,P,Q][0-‐9][A-‐Z, 0-‐9][A-‐Z, 0-‐9][A-‐Z, 0-‐9][0-‐9]
UniProt AC (accession number)
P07437 Protein sequence
UniProtKB h`p://www.uniprot.org/
PaMern Iden6fier Name
Examples En6ty Database URL
[capital le`ers or digits; no ini=al digit]
HGNC gene symbol
TUBB
TUBB1
Human gene
HGNC database
h`p://www.genenames.org/
GO:[7 digits]
GO accession number
GO:0005874
Gene class AmiGO h`p://www.geneontology.org/
[0-‐9][A-‐Z,0-‐9][A-‐Z,0-‐9][A-‐Z,0-‐9]
PDB ID 1TUB Protein, nucleic acid, or complex structure
PDB h`p://www.rcsb.org/
[2 or 3 le`ers or digits]
PDB ligand ID
CN2 Ligand PDB h`p://www.rcsb.org/
PaMern Iden6fier Name
Examples En6ty Database URL
[up to 7 digits]-‐[2 digits]-‐[1 digit]
CAS registry number
64-‐86-‐8 Chemical structure
SciFinder h`ps://scifinder-‐cas-‐org.login.ezproxy.library.ualberta.ca/
[digits] PubChem CID (compound ID)
6167 Chemical structure
PubChem h`p://pubchem.ncbi.nlm.nih.gov/
ZINC[8 digits] OR [digits]
ZINC ID ZINC00621853
621853
Chemical structure
ZINC h`p://zinc.docking.org/
DB[5 digits] DrugBank accession number
DB01394 Drug (chemical structure)
DrugBank h`p://www.drugbank.ca/
Key File Formats for Sequences and Structures
• Sequences – FASTA format .fasta .fst .txt!
• Macromolecule structures – PDB format .pdb .ent!
Accessing Databases
• Web interface
• Query string e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?! db=nucleotide&id=34222261&rettype=fasta&retmode=fasta!
• Web services (SOAP)
• FTP -‐> local copy
CAS SciFinder PubChem
DrugBank ZINC
Cheminforma6c Database Survey
h`ps://scifinder-‐cas-‐org.login.ezproxy.library.ualberta.ca/
h`p://pubchem.ncbi.nlm.nih.gov/
h`p://www.drugbank.ca/
CAS SciFinder PubChem
DrugBank ZINC
h`p://zinc.docking.org/
Cheminforma6c Database Survey
>52 million organic compounds >61 million inorganic compounds
Physical property info
>27 million unique structures >23 million with 3d conforma=ons
Mostly organic, biologically interes=ng compounds
~4,800 drugs >1,350 FDA approved drugs
Includes drug target info
CAS SciFinder PubChem
DrugBank ZINC
>13 million purchasable compounds
Ready to dock
Cheminforma6c Database Survey
N
O
N
O
SS
HO HN
O
N
O
NNH
SS
OH
5 Chaetocin structures from PubChem
CID 161591: no stereochemistry
N
O
N
O
SS
HO HN
O
N
O
NNH
SS
OH
CID 5390098: bad stereochemistry
N
O
N
O
SS
HO HN
O
N
O
NNH
SS
OH
CID 11657687: Natural product stereochemistry
N
O
N
O
SS
HO HN
O
N
O
NNH
SS
OH
CID 46191942: Enan=omer of natural product
N
O
N
O
SS
HO HN
O
N
O
NNH
SS
OH
CID 11563851: incomplete stereochemistry
Stereochemistry Issues
Other Cheminforma6c Issues
• Tautomers / protona=on states? • Salt forms?
• Implicit or explicit hydrogens?
• 2D connec=vity only or 3D conforma=on?
• Non-‐organic elements? – Many programs only handle: CHNOPS + halogens – But some drugs have B, Pt, Hg, As, …
CC(=O)N[C@H]1CCC2=CC!(=C(C(=C2C3=CC=C(C(=O)!C=C13)OC)OC)OC)OC
• Isomeric SMILES – Allows specifica=on of stereochemistry
• Canonical SMILES – Canonicaliza=on will generate a unique string for a molecule, regardless of
atom order – Different programs will canonicalize differently
• SMARTS – Chemical pa`erns for searching or filtering
h`p://www.daylight.com/smiles/index.html
SMILES O H
N
O
O
OO
O
File formats
• MDL Molfile .mol – Allows a 3D conforma=on to be stored
• SDF .sdf!– Wraps Molfile format; mul=ple structures; annota=ons
• PDB .pdb .ent!– Not the best for small molecules
Need to convert? -‐> Try OpenBabel h7p://openbabel.org/wiki/Main_Page
Pathway and Interac6on Databases
KEGG Pathways
NetPath
BioGRID
Pathway and Interac6on Databases
KEGG Pathways
NetPath
BioGRID
h`p://thebiogrid.org/
h`p://www.netpath.org/
h`p://www.genome.jp/kegg/
Pathway and Interac6on Databases
KEGG Pathways
NetPath
BioGRID
A repository for protein and gene interac=on data
345,620 interac=ons
Curated protein signal pathways in humans
20 pathways, 1,800 interac=ons
Manually drawn pathways of metabolism, signaling, and other biological processes
>300 pathways + organism specific versions
Pathway Formats
• SBML .xml!– The Systems Biology Markup Language
h`p://sbml.org/Main_Page
• Also check out the BioPAX format h`p://www.biopax.org/
Pathway Tools
• libSBML h`p://sbml.org/SoUware/libSBML
• Cell Designer h`p://www.celldesigner.org/
• CytoScape h`p://www.cytoscape.org/
<?xml version="1.0" encoding="UTF-8"?><sbml level="2" version="3" xmlns="http://www.sbml.org/sbml/level2/version3">...<listOfSpecies> <species compartment="cytosol" id="ES" /> <species compartment="cytosol" id="P" /> <species compartment="cytosol" id="S" /> <species compartment="cytosol" id="E" /> </listOfSpecies> <listOfReactions> <reaction id="veq"> <listOfReactants> <speciesReference species="E"/> <speciesReference species="S"/> </listOfReactants> <listOfProducts> <speciesReference species="ES"/> </listOfProducts> <kineticLaw> <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <times/> <ci>cytosol</ci>
KEGG: Pathways in Cancer
NetPath: EGFR1 pathway
Exercises
1. What databases are these iden=fiers from? a. 3KYL b. EZH2 c. Q15910 d. GO:0008017 e. GI:8017 f. A9145C
2. Try finding the corresponding entries online
Exercise Answers
1. What databases are these iden=fiers from? a. 3KYL -‐> PDB (a protein-‐RNA structure for telomerase
reverse transcriptase, cataly=c region) b. EZH2 -‐> HGNC (a human gene for a histone lysine methyl
transferase) c. Q15910 -‐> UniProt (a protein sequence for EZH2) d. GO:0008017 -‐> AmiGO (microtubule binding gene
ontology) e. GI:8017 -‐> GenBank (a DNA sequence from D.
melanogaster) f. A9145C -‐> this one’s a trick: it’s a chemical compound;
you can look it up in PubChem with CID: 6438632
Recommended