35
EMBL-EBI Integration of Sequence and 3D structure Databases

Integration of Sequence and 3D structure Databases

Embed Size (px)

DESCRIPTION

Integration of Sequence and 3D structure Databases. EMBL-Bank DNA sequences. Uniprot Protein Sequences. Array-Express Microarray Expression Data. EnsEMBL Human Genome Gene Annotation. EMSD Macromolecular Structure Data. Integration With Uniprot eFamily Project Future Plans. - PowerPoint PPT Presentation

Citation preview

Page 1: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Integration of Sequence and 3D structure Databases

Page 2: Integration of Sequence and 3D structure  Databases
Page 3: Integration of Sequence and 3D structure  Databases

EMBL-BankDNA sequences

EnsEMBLHuman GenomeGene Annotation

UniprotProtein

Sequences

EMSDMacromolecularStructure Data

Array-ExpressMicroarray

Expression Data

Page 4: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Integration With Uniprot

eFamily Project

Future Plans

Page 5: Integration of Sequence and 3D structure  Databases

EMBL-EBI

UniProt (Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

Integration With UniProt

http://www.ebi.ac.uk/uniprot/index.html

Page 6: Integration of Sequence and 3D structure  Databases

EMBL-EBI

MSD / UniProt:

UniProt MSD

Agreedcommon

mechanismfor exchange of information.

ServicesServices

Two Different Database Systems

“One of the major benefits of using databases for data storage is for data sharing”

Page 7: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Collaboration between MSD (Sameer Velankar, Phil McNeil) and UniProt (Virginie Mittard, Daniel Barrell) groups

Depends upon

Clean UniProt (UNP) cross references in the DBREF records for each chain (where possible)

Clean taxonomy ids for each PDB chain

Taxonomy for PDB Source and UniProt OS must be the same

MSD/Uniprot Collaboration

Page 8: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Cleanup of the DBREF records in the PDB entries

Cleanup of the UniProt cross references in PDB entries

Cleanup of Source Information

NCBI Taxonomy IDs

Cleanup of the Reference information

Update UniProt entries

Source, Reference, Secondary structure information

Supply Additional Information

revision date, experimental method, resolution, R-factor

Residue-by-residue mapping between MSD and UniProt enables chimaeras to be handled correctly

Page 9: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Sequence Schema

Page 10: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Residue by Residue Mapping to UniProt

PDB CHAIN UNP SERIAL PDB_RES PDB_SEQ UNP_RES UNP_RES ANNOTATION

1HG1 A P06608 1 ALA 22 A NOT OBSERVED

1HG1 A P06608 2 ASP 23 D NOT OBSERVED

1HG1 A P06608 3 LYS 24 K NOT OBSERVED

1HG1 A P06608 4 4 LEU 25 L

1HG1 A P06608 5 5 PRO 26 P

1HG1 A P06608 6 6 ASN 27 N

1HG1 A P06608 7 7 ILE 28 I

1HG1 A P06608 8 8 VAL 29 V

1HG1 A P06608 9 9 ILE 30 I

1HG1 A P06608 10 10 LEU 31 L

1HG1 A P06608 11 11 ALA 32 A

Page 11: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Display of Mappings

Page 12: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Page 13: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Page 14: Integration of Sequence and 3D structure  Databases

EMBL-EBI

IntEnz is the name for the Integrated relational Enzyme database and is the most up-to-date version of the Enzyme Nomenclature. The IntEnz relational database implemented and supported by the EBI is the master copy of the Enzyme Nomenclature data. MSD uses the UniProt accession code(s) mapped to each chain to link to the IntEnz EC number This done directly via the MSD and IntEnz Oracle

relational databases

Integration With IntEnz

http://www.ebi.ac.uk/intenz/index.html

Page 15: Integration of Sequence and 3D structure  Databases

EMBL-EBIeFamily

http://www.efamily.org.uk/

The eFamily project is designed to integrate the information contained

in five of the major protein databases.

Page 16: Integration of Sequence and 3D structure  Databases

EMBL-EBI

To integrate the information contained in the five major protein databases.

The member databases (CATH, SCOP, MSD, Interpro, and Pfam) contain information describing protein domains.

For SCOP, CATH and MSD the data is primarily concerned with 3D structures In InterPro and Pfam the focus is mainly on the sequences.

It is often difficult for biologists to navigate from protein sequence to protein structure and back again.

eFamily aims to provide the scientific community with a coherent and rich view of protein families that allow users seamlessly to navigate between the worlds of protein structure and protein sequence, by improved data resources and integration via grid technologies.

eFamily Core Activities

Page 17: Integration of Sequence and 3D structure  Databases

EMBL-EBI

UniProt

GO

InterPro

GO

PROSITE

SCOP

Pfam

CATH

GO

Curated

Curated

Curated

Common Domains definition HMM prediction

Curated

Mapping & curation

Mapping per residue

Mapping start – end

MSD mappingResidues/Sequence

DATA INTEGRATION

Page 18: Integration of Sequence and 3D structure  Databases

EMBL-EBI

InterPro-UniProt(s)

UniProt-PDBCHAIN(S)

CATH/SCOP DOMAINPDBCHAIN(S)

InterPro-CATH/SCOP

CATH/SCOP DOMAIN UniProt

Complexity of Mappings

An InterPro entry is a collection of one or more UniProt entries

Unlike PDB concept of CHAIN does not exist in UniProt

UniProt entry is always numbered from 1 to N

PDB SEQRES Residue numbering is from 1 to N

PDB CHAIN (ATOM Records) Residue numbering is not necessarily 1 to N

UniProt to PDB Mapping can be one to many

PDB CHAIN to UniProt Mapping can be one to many

Page 19: Integration of Sequence and 3D structure  Databases

EMBL-EBI

SCOP Domain

PDB Residue RangeChainsSwiss-Prot Residue Range

MSD-SCOP Mapping for 1cbw

Page 20: Integration of Sequence and 3D structure  Databases

EMBL-EBI

MSD-CATH Mapping for 1cbw

CATH DomainsPDB Residue RangeChains

Swiss-Prot Residue Range

Page 21: Integration of Sequence and 3D structure  Databases

EMBL-EBI

MSD-Pfam Mapping for 1cbw

Pfam DomainPDB Residue RangeChains

Swiss-Prot Residue Range

Page 22: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Practical Applications of Database Integration

Page 23: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Mappings Used in Pfam

Pfam now uses UniProt to structure mapping from MSD Search Database

Saves duplication of effort and weeks of compute

Use mapping for annotation of alignments

Pfam domains highlighted on structureof RuBisCo (8ruc)

Page 24: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Mappings Used in Interpro

Page 25: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Mappings Used in SCOP

Page 26: Integration of Sequence and 3D structure  Databases

EMBL-EBIComparison of SCOP, CATH and Pfam

Domains

SCOP, CATH and Pfam have developed web-services for describing their particular domain families. These services can be queried with a protein identifier, protein accession or PDB identifier.

The databases use the MSD/UniProt mapping to translate between the sequence and structure domains

Page 27: Integration of Sequence and 3D structure  Databases

EMBL-EBI

XML & Web Services

The eFamily project has developed a XML schema to describe:

Domains Annotation Sequence Alignments Structure Alignments

This will be used to provide web-services as part of the eFamily project.

More information about the XML schema is available at -http://www.efamily.org.uk/xml/efamily/documentation/efamily.shtml

We are also developing a perl based API for the eFamily XML which will be available from eFamily site as well as via bio-perl.

The MSD residue-by-residue mapping is made available in XML format based on the eFamily schema.

Page 28: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Mapping Annotation

Page 29: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Future Plans

Page 30: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Integration of IntAct database- IntAct provides a freely available, open source database system and analysis tools for protein interaction data.

http://www.ebi.ac.uk/intact/index.jsp

Page 31: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Makes use of cleaned-up cross-reference & taxonomy data, SEQRES and ATOM/HETATM records from the PDB and the sequence from the UniProt entry to align and map each residue.

Makes connected segments from the PDB ATOM/HETATM records for each chain

These are then aligned against the SEQRES records and all the alignments for the segments are merged to get the SEQRES-ATOM alignment

This enables any unobserved residues to be considered

Residue Mapping Program 1

Page 32: Integration of Sequence and 3D structure  Databases

EMBL-EBI

A similar operation is performed on the UniProt sequence and connected segments from the ATOM/HETATM records to get the UNP-ATOM alignment

The SEQRES-ATOM and UNP-ATOM alignments are then merged to get the final alignment

This is repeated for each chain in the PDB archive (with a UNP cross-reference

The mapping is loaded into the MSD relational database and validated

Residue Mapping Program 2

Page 33: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Integrating data from MSD into CATH

Protocols have been developed for regular imports of a subset of MSD data warehouse into a local CATH database set up in ORACLE 9i

For example, information on the biological unit and on protein-ligand interactions will be integrated to increase functional annotations for CATH domain

families

Page 34: Integration of Sequence and 3D structure  Databases

EMBL-EBI

Two step process of data synchronisation

Data are moved from the MSD search database to the CATH-UCL site using a combination of Oracle Export/Import and SQL*Loader utilities

Subsequent updates in the MSD database are pushed to the CATH site using an incremental replication mechanism.

Data from the CATH site are pushed to the MSD site, using the same two step process

The two databases are synchronised

MSD & CATH Data Exchange

Page 35: Integration of Sequence and 3D structure  Databases

EMBL-EBI

StructureSCOPCATH

SequenceUniProt (neé Swiss-Prot

/Trembl/PIR), InterPro, Go, Pfam

FunctionIntEnz

LiteratureMedline