Upload
danghuong
View
223
Download
1
Embed Size (px)
Citation preview
Introduction to RDF and the Semantic Web for the life sciences
Simon Jupp
Sample Phenotypes and Ontologies Team
European Bioinformatics Institute
Day 2 practical session
• Exploring EBI RDF platform
• Querying EBI resources
• Federated queries from one SPARQL endpoint to another
Exercise 17
• Explore the EBI RDF platform at http://www.ebi.ac.uk/rdf
• A) On the ChEMBL endpoint get ChEMBL activities, assays and targets for the drug Clotrimazole (CHEMBL104)
• B) On the Atlas endpoint find expression for ENSDARG00000042641 (Cyp51)
• B2) filter the results by property type contains “organism_part”
• C) On the Reactome endpoint find pathways that references Cyp51 (http://purl.uniprot.org/uniprot/Q1JPY5)
• D) Query the UniProt endpoint to describe http://purl.uniprot.org/uniprot/Q1JPY5
Exercise 17 solution A
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#> PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/> SELECT ?activity ?assay ?target ?targetcmpt ?uniprot WHERE { ?activity a cco:Activity ; cco:hasMolecule chembl_molecule:CHEMBL104 ; cco:hasAssay ?assay . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?uniprot . ?uniprot a cco:UniprotRef }
Exercise 17 solution B
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSDARG00000042641 . }
Exercise 17 solution B1
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSDARG00000042641 . FILTER regex (?propertyType, "organism_part") }
Exercise 17 solution C
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#> SELECT DISTINCT ?pathway ?pathwayname WHERE {?pathway rdf:type biopax3:Pathway . ?pathway biopax3:displayName ?pathwayname . ?pathway biopax3:pathwayComponent ?reaction . ?reaction rdf:type biopax3:BiochemicalReaction . { {?reaction ?rel ?protein .} UNION { ?reaction ?rel ?complex . ?complex rdf:type biopax3:Complex . ?complex ?comp ?protein . }} ?protein rdf:type biopax3:Protein . ?protein biopax3:entityReference <http://purl.uniprot.org/uniprot/Q1JPY5> } LIMIT 100
Exercise 17 solution D
DESCRIBE <http://purl.uniprot.org/uniprot/Q1JPY5>
Federated querying
• One of the biggest advantages of SPARQL and triples stores is the ability to federate queries across endpoints
• Integrating data at query time with SPARQL
GO annotation Expression value
Uniprot Protein
Uniprot GXA
Federated SPARQL
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?experiment ?description WHERE {
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?experiment a atlasterms:Experiment .
?experiment dcterms:identifier ?id .
?experiment dcterms:description ?description .
FILTER regex(?id, "E-GEOD-2852")
}
}
http://tinyurl.com/o9kvvzn
We can execute this query from any other endpoint using the SPARQL SERVICE keyword
Exercise 19
• Execute the following federated query on
• 1. The UniProt SPARQL endpoint
• 2. Your Sesame workbench SPARQL endpoint
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?experiment ?description WHERE {
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?experiment a atlasterms:Experiment .
?experiment dcterms:identifier ?id .
?experiment dcterms:description ?description .
FILTER regex(?id, "E-GEOD-2852")
}
}
Constructing a Federated query
• Basic query to get genes out of our dataset
• How can we integrate this with data from the EMBL-EBI Gene Expression Atlas?
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?geneid ?label WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . }
Querying the Atlas
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000034450 . } ORDER BY ASC (?pvalue)
SELECT DISTINCT ?geneid ?label WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . }
Our query
Example query 3 (http://www.ebi.ac.uk/rdf/services/atlas/sparql)
Integration point
Querying the Atlas
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000034450 . } ORDER BY ASC (?pvalue) PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> SELECT DISTINCT ?geneid ?label ?probe WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?probe atlasterms:dbXref ?geneid } }
Our query
Example query 3 (http://www.ebi.ac.uk/rdf/services/atlas/sparql)
1st gotcha
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> SELECT ?geneid ?label ?probe ?value WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?geneid } }
This should work but there is an issue with querying the EBI RDF Platform with this version of Sesame (fix coming soon!)
1st gotcha
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> SELECT ?label ?probe ?value WHERE { ?result mydata:dbXref <http://identifiers.org/ensembl/ENSMUSG00000024673> . <http://identifiers.org/ensembl/ENSMUSG00000024673> rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref <http://identifiers.org/ensembl/ENSMUSG00000024673> } }
Bind on gene <http://identifiers.org/ensembl/ENSMUSG00000024673>
Exercise 20
• A) Using the previous query, extend it to query the Atlas endpoint to also return the Experiment id and factors (property values) where Ms4ai (ENSMUSG00000024673) is expressed
• B) Filter those results to only include experiments where the factor contains “liver”
Exercise 20 solution A) PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT ?label ?expUri ?propertyValue WHERE { ?result mydata:dbXref identifiers:ENSMUSG00000024673 . identifiers:ENSMUSG00000024673 rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000024673 } }
Exercise 20 solution B) PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT ?label ?expUri ?propertyValue WHERE { ?result mydata:dbXref identifiers:ENSMUSG00000024673 . identifiers:ENSMUSG00000024673 rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000024673 } FILTER regex(?propertyValue, "liver", "i") }
Alzheimer’s Use Case – EBI RDF platform
• EFO term for Alzheimer’s: EFO_0000249
• Get Genes diff expressed for Alzheimer’s
• Get proteins encoded for those genes
• GO annotations from UniProt for those genes
• Get pathways form Reactome in which those proteins are involved
• Get drugs that target proteins within those pathways
Q1. Get Ensembl genes diff expressed for Alzheimer’s
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX dcterms:<http://purl.org/dc/terms/> PREFIX efo: <http://www.ebi.ac.uk/efo/> PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref . ?dbXref rdf:type atlasterms:EnsemblDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000249 . }
Q2. Get UniProt proteins for those genes
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX efo: <http://www.ebi.ac.uk/efo/> PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref . ?dbXref rdf:type atlasterms:UniprotDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000270 . }
Q3. Get UniProt GO Annotations for those genes
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX efo:<http://www.ebi.ac.uk/efo/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX upc:<http://purl.uniprot.org/core/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT distinct ?valueLabel ?goid ?golabel WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref. ?dbXref rdf:type atlasterms:EnsemblDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000249 . ?value rdfs:label ?valueLabel . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?uniprot . SERVICE <http://beta.sparql.uniprot.org/sparql> { ?uniprot a upc:Protein . ?uniprot upc:classifiedWith ?keyword . ?keyword rdfs:seeAlso ?goid . ?goid rdfs:label ?golabel . } }
Q4. get pathways from Reactome for those proteins PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX efo: <http://www.ebi.ac.uk/efo/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#> SELECT DISTINCT ?pathway ?dbXref WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref . ?dbXref rdf:type atlasterms:UniprotDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000270 . SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> {?pathway rdf:type biopax3:Pathway . ?pathway biopax3:displayName ?pathwayname . ?pathway biopax3:pathwayComponent ?reaction . ?reaction rdf:type biopax3:BiochemicalReaction . { {?reaction ?rel ?protein .} UNION { ?reaction ?rel ?complex . ?complex rdf:type biopax3:Complex . ?complex ?comp ?protein . }} ?protein rdf:type biopax3:Protein . ?protein biopax3:entityReference ?dbXref } }
Q5. Get drugs that target proteins within those pathways PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>!PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>!PREFIX efo:<http://www.ebi.ac.uk/efo/>!PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>!PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>!PREFIX cco:<http://rdf.ebi.ac.uk/terms/chembl#>!SELECT distinct ?dbXrefProt ?pathwayname ?moleculeLabel ?expressionValue ?propertyValue!WHERE {!#Get differentially expressed genes (and proteins) where factor is asthma!?value atlasterms:pValue ?pvalue .!?value atlasterms:hasFactorValue ?factor .!?value rdfs:label ?expressionValue .!?value atlasterms:isMeasurementOf ?probe .!?probe atlasterms:dbXref ?dbXrefProt .!?dbXrefProt a atlasterms:UniprotDatabaseReference .!?factor atlasterms:propertyType ?propertyType .!?factor atlasterms:propertyValue ?propertyValue .!?factor rdf:type efo:EFO_0000249 .!#Compunds target them!SERVICE <http://www.ebi.ac.uk/rdf/services/chembl/sparql> {! ?act a cco:Activity ;! cco:hasMolecule ?molecule ;! cco:hasAssay ?assay .! ?molecule rdfs:label ?moleculeLabel .! ?assay cco:hasTarget ?target .! ?target cco:hasTargetComponent ?targetcmpt .! ?targetcmpt cco:targetCmptXref ?dbXrefProt .! ?targetcmpt cco:taxonomy <http://identifiers.org/taxonomy/9606> .! ?dbXrefProt a cco:UniprotRef .!}!SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> {!! ?protein rdf:type biopax3:Protein .! ?protein biopax3:memberPhysicalEntity! ! ![biopax3:entityReference ?dbXrefProt] .! ?pathway biopax3:displayName ?pathwayname .! ?pathway biopax3:pathwayComponent ?reaction .! ?reaction ?rel ?protein!}!}!
Summary
• Why there is a need for new technologies in the life sciences
• Why RDF is a good fit for some of the problems
• The role of ontologies
• Generating RDF triples from data
• Working with an RDF database
• How to write a SPARQL query
• How the EBI is using RDF
Conclusions
• Generating RDF triples is relatively easy
• Extracting the schema from your data can be tricky
• Avoid over modeling – have good use cases
• Look for appropriate ontologies, reuse terms where possible
• Good tooling now available
• RDF APIs for most programming language
• Lots of scalable triples stores
• SPARQL is a powerful query language for RDF
• Also very unforgiving; debugging queries is hard
• Treat the same as you would SQL, not for your average user
Conclusions cont..
• Lots of interest in Linked Data and RDF
• See LOD clouds and DBpedia
• Big name companies using/generating RDF content (Facebook, Google, Oracle)
• Some good examples of applications
• Pharma industry (OpenPhacts project), Semantic publishing (BBC), Government data (data,gov.uk)
• Tread cautiously
• This technology is still maturing
• Not a panacea
• Good solutions for some problems
Thinking beyond RDF and SPARQL
• Selling SPARQL endpoints to biologists is hard i.e. near impossible
• Entry level is too high and advantages too intangible
• Let programmers code against SPARQL
• Let everyone else use more familiar modes through Apps
RDFApps
• Our first RDFApp targets the existing community of R users – an ArrayExpress R package already exists
• Goal is to expose the power of the Atlas RDF+SPARQL behind a conventional R interface
• Enables those working with raw data to also use power of Atlas
Codefest
• Got an idea for an RDF App? Join us at Codefest 2014
• http://www.open-bio.org/wiki/Codefest_2014
• 18th/19th September, Cambridge, UK
Interesting RDF resources for biology
• EBI RDF (http://www.ebi.ac.uk/rdf )
• Bio2RDF (http://bio2rdf.org )
• BioPortal (https://bioportal.bioontology.org )
• OpenPhacts (https://www.openphacts.org )
• PubChem RDF (https://pubchem.ncbi.nlm.nih.gov/rdf/ )
• Identifiers.org (http://identifiers.org )
• Wikipathways (http://wikipathways.org )
• DisGeNet (http://ibi.imim.es/web/DisGeNET/v01/ )
• W3C Healthcare and Life Sciences Working Group (HCLS - http://www.w3.org/blog/hcls/ )
Acknowledgments
• Samples Phenotypes and Ontologies Group and Functional Genomics Production Team
• James Malone, Robert Petryszak, Tony Burdett, Helen Parkinson
• EBI RDF platform
• Andy Jenkinson, Mark Davies, Marco Brandizi, Sarala Wimalaratne, Leyla Garcia, Jerven Bolleman
Funding
Components of the RDF platform pilot are supported by a number of sources, including:
• EMBL
• European Commission:
• BioMedBridges [284209]
• Diachron [601043]
• OpenPhacts (Innovative Medicines Initiative)
• National Institutes of Health
Questions?
Sign up for our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/rdf-announce