Light Blue Shapes
URI based distributed querying
Peter Ansell
Aim
Access normalised RDF information located in multiple endpoints using the concept of Public Namespaces and Private Record Identifiers and distributed SPARQL queries which are matched to the content in each endpoint
Overall concepts
Query Types : Basically wrapping up SPARQL queries based on a regular expression matching an input query string.
Normalisation Rules : Rules that define the transformations from a standard normalised URI system to a system matching a particular endpoint, and the reverse if necessary
Providers : The entities which provide the information. They can be SPARQL endpoints or even simple URL's. If they are proxied they should return RDF information, but redirects are also available for other providers.
URI resolution example
User enters HTTP URL into their user agenthttp://mybio2rdf.local/namespace:identifier
Servlet receives requestHostname: mybio2rdf.local
Query string: /namespace:identifier
Servlet performs URL rewriting to pass query string to the atlas2rdf.jsp page based on WEB-INF/urlrewrite.xml
URI resolution example
The query string is matched against the regular expressions in the configured query types and the unique query titles which had successful matches are selected
/namespace:identifier matches at least http://qut.bio2rdf.org/query:construct and http://qut.bio2rdf.org/query:taglabels
URI resolution step
For each of the query types a namespace test is applied to determine which regular expression matching groups are relevant, and whether the query type matches the given namespace
URI resolution step
Namespace test:
Is the query type specific to namespaces? If false, include the query type.
See CUSTOM_QUERY_NAMESPACE_PROVIDER_SPECIFIC
If so, is the query type relevant to all namespaces. If true, include the query type
See CUSTOM_QUERY_HANDLE_ALL_NAMESPACES
If not, check whether the query string matching groups matched either any or all of the query types namespacesas configuredof the matching group numbers declared for the query type.
See CUSTOM_QUERY_NAMESPACES_TO_HANDLE, CUSTOM_QUERY_NAMESPACE_INPUT_INDEXES, and CUSTOM_QUERY_NAMESPACE_MATCH_METHOD
URI resolution example
Both query:construct and query:taglabels are relevant to all namespaces, and contain the namespace as the first matching group index, and since they have only one matching group as a namespace the match method is not relevant
URI resolution step
For each of the chosen query types, get a list of providers which handle the query title
If a query type is namespace specific, filter its list of providers based on whether they match any or all of the namespaces according to the query title namespace matching configuration. This time the inclusion is based on the namespace test with the list of namespaces configured for the provider
URI resolution example
The query titles construct and taglabels were chosen, so they are now matched against the total list of providers to gain an initial list
The construct query is namespace specific so only construct providers which handle the given namespace will be included, where the taglabels query is not namespace specific so the any taglabels providers will be included in the final provider list
URI resolution step
Any of the providers which were defined as default and which handle the given query type are also included at this stage, without regard to the namespaces.
Default providers are intended to make it simpler to configure intermediate servers without having to know about all of the known namespaces
URI resolution step
For each of the query types, for each of the providers which remain.
If a provider needs a redirect, as opposed to proxying communication, replace any template variables on the endpoint URL and send an HTTP 302 redirect response as the result
URI resolution step
If no redirects generate the actual queries based on the templates given in the query types and the normalisation rules for the provider
The normalisation rules are matched against the template variables and replaced as necessary in order to make them specific to the relevant endpoint
Query templates
Some of the template variables include:${graphStart} and ${graphEnd} to allow for SPARQL graphs, or the lack of a graph
${endpointSpecificUri} to allow for the SPARQL endpoint to contain a different URI to the one which is desired
${input_1}, ${input_2}, etc., which correspond to the matching groups from the query type. ${input_1} is typically the namespace, although this is configurable.
Query templates
Some more template variables include:${graphUri} if it doesn't exist it is empty
${endpointUrl} this can also have template variables inside it, which are replaced before the redirect check phase
${defaultHostAddress} the standard base URL for this configuration, ie, http://bio2rdf.org/
${realHostName} the actual host being used, ie. http://mymirror.local/bio2rdf/
Query templates
Some template variables are available in their encoded forms. For example:${urlEncoded_endpointSpecificUri} a fully percent encoded version of the URI
${inputUrlEncoded_normalisedStandardUri} a version of the standard URI as given by the query type with the ${input_NN} sections internally percent encoded
${xmlEncoded_inputUrlEncoded_normalisedStandardUri} for use in RDF/XML templates
${inputUrlEncoded_privatelowercase_endpointSpecificUri} for use with endpoints which contain percent encoded URI's that have the private ${input_NN} variables completely in lowercase without regard to the case given in the ${queryString}
${queryString} The original input string which matched against the query type regular expression
Query templates example
For http://bio2rdf.org/namespace1:identifer1${queryString}=namespace1:identifier1
The other variables will be different depending on whether the construct provider for namespace1 is being contacted, or
URI resolution step
For each query, check its communication method
If it is declared as nocommunication, ignore it for now. It will be used with the static RDF/XML insertion stage
If it is declared as httpgeturl then perform HTTP resolution on the provider endpoint URL after replacing the relevant template variables
URI resolution step
If the communication method is declared as httppostsparql then POST the replaced query template to the endpoint URL
The SPARQL query is matched to the endpoint at this stage by the use of a query type that contains the basic structure of the query, and normalisation rules to make sure the URI's in the SPARQL match the endpoint and Graph combination
URI resolution step
The results of the httpgeturl and httppostsparql HTTP requests are passed through the list of rdf normalisation rules which are configured for the provider that was chosen so that they are normalised to the desired output format
More than one provider may be attached to the same endpoint and graph combination, so a given URI may resolve using more than one query on the same endpoint and graph depending on the query needs
Accessible databases
Each of the following databases have normalisation rules which normalise them back to bio2rdf.org URI'sDbpedia, Drugbank, LinkedCT, HCLS KB/Neurocommons, Diseasome, Dailymed, Bioguid DOI
These, together with the 40+ Bio2RDF sparql endpoints form a very large accessible knowledge base!
RDF accessible configuration
The configuration, including all query types, RDF normalisation rules, providers and known namespaces is available in RDF
http://qut.bio2rdf.org/admin/configuration/rdfxml
Integrating user extensions
A clear use case for a system where arbitrary queries can be performed as part of a single URI resolution is to integrate novel datasources such as user tags
The only requirement is that the query type relevant to the tags etc., matches the regular expression for the the URI it is extending. For example http://qut.bio2rdf.org/query:taglabels and http://qut.bio2rdf.org/query:construct both have regular expressions that match the basic http://bio2rdf.org/namespace:identifier URI
Future work
Content negotiation between RDF formats
HTML formatted results for easy browsing, possibly using Pubby as the rendering engine
Paged SPARQL calls using OFFSET and LIMIT
Alternative configurations for Dbpedia, SharedNames etc. that don't require http://bio2rdf.org/ as the base URI and have different basic queries
Import configuration from RDF similar to the current configuration output
Future work
Provide more pipes to perform integrated actions without having to put HTTP SPARQL requests into a workflow system when a URI resolution can perform the query in a distributed and normalised manner more efficiently
Bring together the current distributed efforts to provide a complete HTML redirection registry so that a large percentage of Bio2RDF namespaces can be redirected with http://bio2rdf.org/html/namespace:identifier
Form ontologies describing the query type, provider, rdf normalisation rule, namespace paradigm
Future work
Integrate http://rdf.myexperiment.org/sparql and similar workflow RDF endpoints so that scientific workflows can be linked to their data cleanly, and user enhancements such as tags and publications are cleanly integrated with the actual datasources they were derived from