69
Building an RDF representation of the the ChEMBL Database RDF Workshop Mark Davies ChEMBL Group, Technical Lead 30/04/2014

RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Building an RDF representation of the the ChEMBL Database

RDF Workshop

Mark Davies ChEMBL Group, Technical Lead

30/04/2014

Page 2: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Overview •  Brief introduction to ChEMBL database

•  Approaches to mapping relational data to RDF data model based on ChEMBL experience

•  New features in ChEMBL RDF (version 18)

•  Future ChEMBL plans

Page 3: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

•  Open access database for drug discovery

•  Freely available (searchable and downloadable)

•  Content:

•  Bioactivity data manually extracted from the primary medicinal chemistry literature from journals such as J. Med. Chem.

•  Subset of data from PubChem

•  Deposited data e.g. neglected disease screening, GSK kinase set

•  Bioactivity data is associated with a biological target and a chemical structure

•  Compounds are stored in a structure searchable format

•  Protein targets are linked to protein sequences in UniProt

•  Updated regularly with new data

•  Secure searching (https://www.ebi.ac.uk/chembldb )

What is ChEMBL?

Page 4: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL

https://www.ebi.ac.uk/chembl/

•  ChEMBL 18 Release

•  1,359,508 compounds

•  12,419,715 activities

•  1,042,374 assays

•  9,414 targets

•  53,298 documents

•  19 bioactivity sources

•  6 compound-only sources

Page 5: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Compound Target Activity Assay Ref

What does ChEMBL data look like?

Page 6: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

How can I access ChEMBL data?

Website Web Services

Widgets Downloads

Virtual Machine (myChEMBL)

Semantic Web

Page 7: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL + Semantic Web

•  The creation of the RDF version of ChEMBL is funded by the Open PHACTS project - http://www.openphacts.org/

•  Migrate the ChEMBL relational data model to RDF based data model – ‘triplify’ everything

•  RDF generation is part of official ChEMBL release process

•  Identify and use ontologies important in the field of bioactivity data

•  ChEMBL RDF to be made available through EBI RDF Platform

Page 8: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Goals of the ChEMBL RDF conversion

•  Responding to the demands of the community

•  Academic and more recently industry

•  Semantic data conversion and querying

•  Reasoning/inferencing - providing a starting point for the community

•  Ensure the conversion is part of the ChEMBL release cycle

•  ChEMBL data model is still evolving so almost impossible for external efforts to keep up to speed with changes

Page 9: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/

ChEMBL RDF

Compound Bioactivity Assay Target Ref

Page 10: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Conversion process

ChEMBL RDF Schema ChEMBL Relational Schema

Page 11: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Migrating a relational data model

•  Two approaches were used to convert the ChEMBL relational model to an RDF based model

•  Approach 1: Semi-automated using the D2RQ Platform

•  Approach 2: Manual model building

Where to start?

What tools to use?

What is your goal?

Who is the audience?

Will it be made available to the public?

How often will it be updated?

Which ontologies to use?

Do you write you own ontology?

Which format to use?

Page 12: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Approach 1: Semi-automated using the D2RQ Platform

Page 13: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ platform overview

•  Query a non-RDF database using SPARQL

•  Access the content of the database as Linked Data over the Web

•  Create custom dumps of the database in RDF formats for loading into an RDF store

•  Access information in a non-RDF database using the Apache Jena API

http://d2rq.org/

Page 14: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL relational schema

•  3 core domains

•  Compound

•  Activity

•  Target

•  52 tables (52 primary keys J)

•  341 columns

•  4 data types (40 if length, scale and precision included)

•  Many indexes, constraints, triggers..

Page 15: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ: Prerequisites

•  Download the software from http://d2rq.org/

•  Java 1.5 or higher

•  Oracle users will need to download database driver

Page 16: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ: Example usage

•  Aims

•  Create RDF representation of your relational data model

•  Run and test SPARQL queries against your database

•  Online data access and representation

•  Process

•  Generate “D2R Mapping File”

•  Start up D2R Server using “D2R Mapping File”

•  Refine model

•  Support databases

•  Oracle, SQL Server, PostgreSQL, MySQL, HSQLDB,…

http://d2rq.org/

Page 17: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ: Mapping File Creation

•  Command used to generate D2R Mapping File (you just need a database connection string):

•  Command above will inspect database and create RDF based definitions of tables and relationships between tables

•  Possible to skip/restrict schemas/tables/columns with additional argument – useful for Oracle

$>./generate-mapping -o example_d2r_mapping.ttl \! -u <USER> \! -p <PASSWORD> \! -d oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@<SERVER>:<PORT>:<DATABASE>!

Page 18: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ: Mapping file example @prefix map: <#> .!@prefix db: <> .!@prefix vocab: <vocab/> .!@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .!@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .!@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .!@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .!@prefix jdbc: <http://d2rq.org/terms/jdbc/> .!!map:database a d2rq:Database;! d2rq:jdbcDriver "oracle.jdbc.driver.OracleDriver";! d2rq:jdbcDSN "jdbc:oracle:thin:@<SERVER>:<PORT>:<DATABASE>";! d2rq:username "<USER>";! d2rq:password "<PASSWORD>";! .!!# Table CHEMBL_18.ACTION_TYPE!map:CHEMBL_18_ACTION_TYPE a d2rq:ClassMap;! d2rq:dataStorage map:database;! d2rq:uriPattern "CHEMBL_18/ACTION_TYPE/@@CHEMBL_18.ACTION_TYPE.ACTION_TYPE|urlify@@";! d2rq:class vocab:CHEMBL_18_ACTION_TYPE;! d2rq:classDefinitionLabel "CHEMBL_18.ACTION_TYPE";! .!map:CHEMBL_18_ACTION_TYPE__label a d2rq:PropertyBridge;! d2rq:belongsToClassMap map:CHEMBL_18_ACTION_TYPE;! d2rq:property rdfs:label;! d2rq:pattern "ACTION_TYPE #@@CHEMBL_18.ACTION_TYPE.ACTION_TYPE@@";! .!map:CHEMBL_18_ACTION_TYPE_ACTION_TYPE a d2rq:PropertyBridge;! d2rq:belongsToClassMap map:CHEMBL_18_ACTION_TYPE;! d2rq:property vocab:CHEMBL_18_ACTION_TYPE_ACTION_TYPE;! d2rq:propertyDefinitionLabel "ACTION_TYPE ACTION_TYPE";! d2rq:column "CHEMBL_18.ACTION_TYPE.ACTION_TYPE";! .!map:CHEMBL_18_ACTION_TYPE_DESCRIPTION a d2rq:PropertyBridge;! d2rq:belongsToClassMap map:CHEMBL_18_ACTION_TYPE;! d2rq:property vocab:CHEMBL_18_ACTION_TYPE_DESCRIPTION;! d2rq:propertyDefinitionLabel "ACTION_TYPE DESCRIPTION";! d2rq:column "CHEMBL_18.ACTION_TYPE.DESCRIPTION";! .!map:CHEMBL_18_ACTION_TYPE_PARENT_TYPE a d2rq:PropertyBridge;! d2rq:belongsToClassMap map:CHEMBL_18_ACTION_TYPE;! d2rq:property vocab:CHEMBL_18_ACTION_TYPE_PARENT_TYPE;! d2rq:propertyDefinitionLabel "ACTION_TYPE PARENT_TYPE";! d2rq:column "CHEMBL_18.ACTION_TYPE.PARENT_TYPE";! .!

Page 19: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2R Server

$>./d2r-server example_d2r_mapping.ttl !

•  SPARQL endpoint and explorer

•  Browsing database contents

•  Resolvable URIs

•  Content negotiation

•  Downloading contents of BLOBs/CLOBs

•  Serving the vocabulary

•  Publishing metadata

•  Command used to start D2R server (assuming you have generated mapping file):

http://d2rq.org/

Page 20: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2R Server

http://d2rq.org/

Page 21: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2R Server

http://d2rq.org/

Page 22: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

(Quick) D2RQ Demo

Page 23: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ: Data modeling

•  It is possible to model data and also create more meaningful class and property names

•  Approach 1

•  Edit the mapping file using advanced features of the D2RQ query language: http://d2rq.org/d2rq-language

•  Approach 2

•  Create mapping file based on restricted set of database objects e.g. users, schemas, views, materialised views – modeling within the database

Page 24: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ: Optimisations

•  Review D2R server deployment

•  Increase D2R server heap space

•  Review configurations settings, such as page sizes, resultset limits

•  Use latest built-in D2RQ optimisations, by specifying d2rq:useAllOptimizations (or --fast flag on server startup)

•  Use D2RQ’s dump-rdf command to export RDF representation of database •  Exported RDF can then be imported into a triplestore, e.g.

Virtuoso

Page 25: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

D2RQ: Limitations

•  General limitations

•  Integration of multiple databases not possible – achieved within the database

•  Read only access – update extension is available

•  Limited inferencing available

•  Named graphs not supported

•  Users tend to end up creating ‘weird’ mapping files

•  Database models are often not perfect or clean, which complicates mapping file creation process

Just mapping to RDF is not enough

Page 26: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Approach 2: Manual model building

Page 27: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Approach 2 outline

•  Building a basic ontology and instantiate with data

•  Steps:

•  Define entities in data source

•  Define relationships between entities

•  Define properties of the entities

•  Identify and use external ontologies

Page 28: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Model building considerations •  Review available technologies/languages

•  Preferred may not have good RDF support/libs

•  RDF data formats e.g. rdf, ttl, n3

•  All are interchangeable, but some are considered more readable and offer reduced size

Page 29: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL relational schema revisited

Molecules

Targets

References

Activities

Assays

Binding Sites

MOAs

Drugs

Page 30: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Substance Activity

Assay Document

Target

Target-Component Source Journal

Protein-Classification

Bio-Component

ChEMBL Entities/Classes

ChEMBL 17 classes

Page 31: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

•  An OWL based ontology used to define ChEMBL classes

•  OWL snippet used to define ChEMBL assay:

•  Tools available to help build and write ontologies, e.g. Protégé and TopBraid Composer

ChEMBL class definition

@prefix : <http://rdf.ebi.ac.uk/terms/chembl#> .!!:Assay! rdf:type owl:Class ;! rdfs:label "ChEMBL Assay Class"^^xsd:string ;! rdfs:subClassOf :ChEMBL .!

Page 32: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Entity RDF representation

•  Entity names become classes, which allow you to type your data

•  ‘chembl_assay:’ and ‘cco:’ are prefixes and ‘a’ is a turtle shorthand for rdf:type

!chembl_assay:CHEMBL615672 a cco:Assay .!

Page 33: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Substance Activity

Assay Document

Target

Target-Component Source Journal

Protein-Classification

Bio-Component

ChEMBL Entity relationships

Page 34: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Relationship RDF representation

•  Relationships defined between instances of your entities are object properties

!chembl_assay:CHEMBL615672 a cco:Assay ;! cco:hasTarget chembl_target:CHEMBL612910 ;! cco:hasActivity chembl_activity:CHEMBL_ACT_227195 .!

Page 35: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL assay attributes

•  Identify attributes from database you want to include in RDF model

•  Map attribute types e.g. integers, strings, Booleans

•  Some attributes map to external resources/ontologies – see later

•  Denormalisation of relational data, e.g. FKs

Page 36: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Attribute RDF representation

•  Attributes you define for your classes are datatype properties

•  Good practice to add a rdfs:label to all instances

!chembl_assay:CHEMBL615672 a cco:Assay ;! cco:hasTarget chembl_target:CHEMBL612910 ;! cco:hasActivity chembl_activity:CHEMBL_ACT_227195 ;! rdfs:label "CHEMBL615672" ;! cco:chemblId "CHEMBL615672" ;! cco:assayType "Functional" ;! cco:assayCellType "3LL cell line" ;! cco:organismName "Mus musculus” .!!

Page 37: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

More examples ChEMBL Entity properties

Substance

Target

Activity

TopBraid Composer (Free Edition)

Page 38: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Mapping to external ontologies

•  Examples of ontologies/taxonomies mapped to in ChEMBL RDF include:

•  BioAssay Ontology (BAO)

•  ChEBI

•  Chemical Infomation Ontology (CHEMINF)

•  Bibliographic Ontology

•  Unit Ontology (UO)

•  QUDT Ontology

•  Semantic Science Ontology (SIO)

•  Cell Line Ontology (CLO)

•  Experimental Factor Ontology (EFO)

Page 39: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

External ontologies/taxonomies

•  Identification of relevant external ontologies

•  Community consensus + recommendations

•  BioPortal - https://bioportal.bioontology.org/

•  Ontology Lookup Service - https://www.ebi.ac.uk/ontology-lookup/

Page 40: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Substance Activity

Assay Document

Target

Target-Component Source Journal

Protein-Classification

Bio-Component

ChEMBL assay data

Page 41: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL assay annotation

•  The assay is the central component to the ChEMBL data model

•  Current model not ideal

•  Single category - binding, functional, ADMET, physchem

•  Unstructured/free text used to describe assay

•  Many assay parameters not captured – although often not available

•  Ontologies are now being used to improve ChEMBL assay annotations - ChEMBL_17 onwards

•  Mappings to BAO bioassays, assay_format, endpoints

•  http://bioassayontology.org

Page 42: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

BioAssay Ontology

Bioassay parent class, 92 descendant classes

How do we map to all these BAO assay classes?

Page 43: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

External ontology mapping process

•  In many cases mapping is straight forward

•  Use common bridging identifier e.g. UniProt

•  Simple text based conversion e.g. units - actually units not so straight forward in ChEMBL

•  Some mappings require complex rules e.g. assay details

•  Multiple database parameters

•  Complex text processing

•  Manual curation

•  Tools available to assist with mapping process

•  BioPortal Annotator (http://bioportal.bioontology.org/annotator)

•  Zooma (http://www.ebi.ac.uk/fgpt/zooma/)

•  ChEMBL Assay Annotator

Page 44: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

BioPortal Annotator

ChEMBL Assay Description

Restricted to Ontology interest (optional)

Results

API available http://data.bioontology.org/documentation

http://bioportal.bioontology.org/annotator

Page 45: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

BioPortal Annotator Example

•  CHEMBL2213497 assay description

•  More information here:

https://www.ebi.ac.uk/chembl/assay/inspect/CHEMBL2213497

Now use BioPortal Annotator to annotate…

“Induction of apoptosis in human Jurkat T cells overexpressing Neo assessed as loss in mitochondrial membrane potential at 30 ug/ml after 36 hrs by DiO6-based flow cytometry (Rvb = 5.4%)”

Page 46: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL Assay Annotator

•  ChEMBL Assay Annotator developed by Samuel Croset

•  Aim is to map ChEMBL assays to BAO assay classes

•  ‘Tailored’ mapping rules developed

Page 47: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

External mapping representation

!chembl_assay:CHEMBL615672 a cco:Assay ;! cco:hasTarget chembl_target:CHEMBL612910 ;! cco:hasActivity chembl_activity:CHEMBL_ACT_227195 ;! rdfs:label "CHEMBL615672" ;! cco:chemblId "CHEMBL615672" ;! cco:assayType "Functional" ;! cco:assayCellType "3LL cell line" ;! cco:organismName "Mus musculus” ;! bao:BAO_0000205 bao:BAO_0000219 .!!!

BAO_0000205 = has_assay_format BAO_0000219 = “Cell based”

•  In this example defining assay_format:

Page 48: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL Core Ontology (CCO)

•  The skeleton schema used to store ChEMBL classes, object properties and datatype properties

•  The file is also RDF, so can be queried independent of an instances

•  ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/18.0/cco.ttl.gz

•  Namespace: http://rdf.ebi.ac.uk/terms/chembl#

•  Initial focus on Substance (Molecule) and Target Classification

•  In future an additional mapping file may be provided, which maps/aligns ChEMBL classes and properties to external resources

Page 49: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL Core Ontology (CCO)

Classes Target Classes Substance Classes

Page 50: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

ChEMBL RDF schema

https://www.ebi.ac.uk/rdf/documentation/chembl

Page 51: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

The (raw) end result

Page 52: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Querying ChEMBL data

•  Need to load files into triplestore (Virtuoso Open Source)

ChEMBL Data

External Ontos

e.g. BAO

CCO

ChEMBL Triplestore

ChEMBL SPARQL

Interface/LD Browser

http://www.ebi.ac.uk/rdf/services/chembl/sparql

Reactome Triplestore

UniProt Triplestore

Bio2RDF

Page 53: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Using ‘external’ RDF data sources

•  Questions to think about when using external RDF data sources

•  Who creates resource and the RDF representation?

•  When was the resource last updated?

•  When was RDF last updated?

•  Does the data model make sense?

•  Basic queries work?

•  Shared entities and ontologies?

•  Any data licensing issues?

Page 54: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

VoID can help

•  VoID = Vocabulary of Interlinked Datasets

•  Acts a bridge between publishers and users

•  EBI RDF resources provide a VoID (just an extra RDF file)

•  Information contained in VoID

•  Creation timestamps

•  Publisher details

•  Versioning

•  Ontologies/vocabularies used

•  Licensing

•  Data formats available and where they live (not just RDF)

•  More complex information such as Subsets and Linksets

Page 55: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Quick look at the ChEMBL VoID…

Page 56: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Model building recommendations

•  Technology review

•  URIs should resolve and be future proofed

•  Ensure the correct external namespaces are being used

•  Add rdfs:label to everything

•  Consider using identifiers for ontology names instead of textual descriptions

•  As ‘small’ descriptive ontology grows consistent naming conventions can breakdown

•  Not used for CCO, but may consider future format switch e.g. CCO_000001 = ChEMBL Activity, CCO_000002 = ChEMBL Assay and so on

Page 57: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Technology stack

•  Triple Processing

•  Groovy

•  OpenRDF Sesame Java API (http://www.openrdf.org)

•  Rapper – useful command line utility

•  Triplestore/Storage

•  Virtuoso Open Source Edition 6.1.7 (Upgrade to Version 7 planned)

•  Raw .ttl files available to download from ChEMBL FTP site ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/

•  Domain/class specific .ttl files created – helps processing and loading

Page 58: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

New features ChEMBL 18 RDF

•  More data, now 409,989,782 triples

•  New types of data

•  Binding sites, cell lines, mechanism of action

•  New properties

•  Molecule hierarchy mappings

•  Target complex mappings

•  Assay parameters

•  Improved mappings to the BAO ontology assay_format, e.g. biochemical, physiochemical, cell based,…

•  Some example queries now follow ->

Page 59: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Example query 1

•  Get all human cell-lines:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>!PREFIX dcterms: <http://purl.org/dc/terms/>!PREFIX bao: <http://www.bioassayontology.org/bao#>!PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>!!SELECT ?cellLine ?cellName!WHERE {! ?cellLine a cco:CellLine ;! cco:taxonomy <http://identifiers.org/taxonomy/9606> ;! rdfs:label ?cellName .!}!

http://tinyurl.com/odqulmq

Page 60: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Example query 2

•  Get all compounds that have been tested in a cell-based (bao:BAO_0000219) toxicity assay in HepG2 cells:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>!PREFIX dcterms: <http://purl.org/dc/terms/>!PREFIX bao: <http://www.bioassayontology.org/bao#>!PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>!!SELECT ?mol ?assayDesc!WHERE {! ?mol ?p cco:Substance ;! cco:substanceType ?moltype ;! cco:hasActivity ?activity .! ?activity cco:hasAssay ?assay .! ?assay cco:assayCellType 'HepG2' ;! cco:assayType 'Toxicity' ;! bao:BAO_0000205 bao:BAO_0000219 ;! dcterms:description ?assayDesc .!}!

http://tinyurl.com/oyttvlr

Page 61: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Example query 3

•  Get all concentration response assays (bao:BAO_0002162) for monoamine receptor targets (CHEMBL_PC_1266):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>!PREFIX dcterms: <http://purl.org/dc/terms/>!PREFIX bao: <http://www.bioassayontology.org/bao#>!PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>!!SELECT distinct ?assay ?assayDesc!WHERE{! <http://rdf.ebi.ac.uk/resource/chembl/protclass/CHEMBL_PC_1266> cco:hasTargetDescendant ?target .! ?target cco:hasAssay ?assay .! ?assay cco:hasActivity ?activity ;! dcterms:description ?assayDesc .! ?activity bao:BAO_0000208 ?endpoint .! ?endpoint rdfs:subClassOf bao:BAO_0002162 .!}!

http://tinyurl.com/o6qg8uk

Page 62: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Example query 4

•  Get the number of ADME assays carried out in organism-based (bao:BAO_0000218) format for FDA approved drugs:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>!PREFIX dcterms: <http://purl.org/dc/terms/>!PREFIX bao: <http://www.bioassayontology.org/bao#>!PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>!!SELECT ?molname ?mol (count(distinct ?assay) as ?assay_count)!WHERE{! ?assay a cco:Assay ;! cco:assayType 'ADME' ;! bao:BAO_0000205 bao:BAO_0000218 .! ?assay cco:hasActivity ?activity .! ?activity cco:hasMolecule ?mol .! ?mol cco:highestDevelopmentPhase 4 ;! rdfs:label ?molname .!}!GROUP BY ?molname ?mol!ORDER BY DESC(count(distinct ?assay))!

http://tinyurl.com/psu5442

Page 63: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Example query 5

•  Get all cell-lines that have been used in physical property assays (bao:BAO_0002128):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>!PREFIX dcterms: <http://purl.org/dc/terms/>!PREFIX bao: <http://www.bioassayontology.org/bao#>!PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>!!SELECT ?cellLine ?assay!WHERE {! ?cellLine a cco:CellLine ;! cco:isCellLineForAssay ?assay .! ?assay cco:hasActivity ?activity .! ?activity bao:BAO_0000208 ?endpoint .! ?endpoint rdfs:subClassOf bao:BAO_0002128 .!}!

http://tinyurl.com/ojytly3

Page 64: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Example query 6

•  Get all Protein Kinase (CHEMBL_PC_1100)  inhibitor binding sites:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>!PREFIX dcterms:<http://purl.org/dc/terms/>!PREFIX bao: <http://www.bioassayontology.org/bao#>!PREFIX cco:<http://rdf.ebi.ac.uk/terms/chembl#>!!SELECT ?target ?bindingSite ?siteName ?inhibitor!WHERE{! ?bindingSite a cco:BindingSite ;! cco:bindingSiteName ?siteName ;! cco:hasTarget ?target .! <http://rdf.ebi.ac.uk/resource/chembl/protclass/CHEMBL_PC_1100> cco:hasTargetDescendant ?target .! ?target rdfs:label ?targetName ;! cco:isTargetForMechanism ?mechanism .! ?mechanism cco:mechanismActionType 'INHIBITOR' ;! cco:mechanismDescription ?mechanismDesc ;! cco:hasMolecule ?molecule .! ?molecule rdfs:label ?inhibitor .!}!

http://tinyurl.com/onr2yto

Page 65: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Future Plans: SureChEMBL

•  December 2013 EMBL-EBI acquired SureChem – a leading ‘chemistry patent mining’ product from Digital Science, Macmillan Group

•  SureChem provides a live (updated daily) view chemical patent space

•  Rebranded SureChEMBL

https://www.surechembl.org

Page 66: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Open PHACTS extension

•  Open PHACTS project is keen to include patent data in future extensions to the project

•  ENSO approved - funding to include SureChEMBL data in Open PHACTS

•  RDF conversion, target indexing and API development

•  EBI-RDF project benefit from RDF conversion

•  SureChEMBL is updated daily, compared to quarterly ChEMBL updates

•  Interesting challenge for us creating exports and systems loading SureChEMBL

Page 67: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Open PHACTS Platform

Nanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services

Identity Resolution

Service

Chemistry Registration Normalisation & Q/C

Identifier Management

Service

Indexing

Cor

e Pl

atfo

rm

P12374 EC2.43.4

CS4532

“Adenosine receptor 2a”

VoID

Db

Nanopub

Db

VoID

Db

VoID Nanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

(slide author: Lee Harland)

Page 68: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Summary

•  Review of the ChEMBL database

•  Two approaches used to modeling ChEMBL data

•  Approach 2 used to build RDF representation of ChEMBL

•  New features included in ChEMBL_18 release

•  Model enhancements

•  More data

•  Plans for the future

•  Patents

•  Open PHACTS

Page 69: RDF Workshop - BioMedBridges · Migrating a relational data model • Two approaches were used to convert the ChEMBL relational model to an RDF based model • Approach 1: Semi-automated

Acknowledgements

ChEMBL Group

•  Anna Gaulton

•  Samual Croset

•  John Overington

Open PHACTS

•  Alasdair Gray

•  Antonis Loizou

•  Lee Harland

•  Egon Willighagen

EBI-RDF Group

•  Andy Jenkinson

•  Simon Jupp

•  James Malone

Groups and people involved in the RDF representation of ChEMBL include: