ontop: A tutorial

+

ontop: A Tutorial

Mariano Rodriguez Muro, Ph.D.

Free University of Bozen-Bolzano

Bolzano, Italy

http://www.rodriguez-muro.com

Protégé-OWL Short Course

September 2-4, 2013

Vienna, Austria

+Disclaimer

License

This work is licensed under the Creative Commons Attribution-Share Alike 3.0 License http://creativecommons.org/licenses/by-sa/3.0/

The material for this presentation is available at:https://www.dropbox.com/sh/q3aowgiq5dnco7n/as0QniGPKy

mailto:http://creativecommons.org/licenses/by-sa/3.0/

https://www.dropbox.com/sh/q3aowgiq5dnco7n/as0QniGPKy

https://www.dropbox.com/sh/q3aowgiq5dnco7n/as0QniGPKy

+Who am I?

Researcher at:

Free University of Bozen Bolzano,Bolzano, Italy

From October at:

IBM Watson Research Center

Research topics: OBDA, Efficient reasoning in OWL, Query rewriting, Data integration

Leader of the ontop project

+Why are we here?

Data and ontologies

To get the basics of ontology based data access

To learn how to do it with ontop

To grasp some of the possible uses of the technology and hint to the resources available

+Tutorial Overview

Part 1: Introduction Quick Introduction to Ontology Based Data Access

Part 2: The basics Creating an SQL database, creating simple (direct) mappings with

ontopPro and querying.

Part 3: Modeling in OBDA Creating mappings that reflect the domain

Part 4: Data Integration Using ontop to query data from multiple sources

Part 5: Ontologies and ontop Extending and using with domain knowledge (OWL)

+Material

The tutorial is organized as a hands-on session. Try to perform the described tasks.

Most command/mappings/queries are in files in the “materials” folder, still, try to write them on your own

Material included (ontop-tutorial-viena13.zip) README.txt is an index for the ZIP file H2 Database (h2.zip) .obda/.owl files (resulting mappings and ontologies for all examples) .sql files (. SQL commands that create the tutorial DBs)

+Part 1 Introduction

+SQL DBs

Standard way to store LARGE volumes of data

Mature, Robust and FAST

Domain is structured as tables, data becomes rows in these tables.

Powerful query language (SQL) to retrieve this data.

Major companies developed SQL DBs for the last 30 years (IBM, Microsoft, Oracle) and even open source projects are now quite robust (MySQL, PostgreSQL).

+OBDA and motivation

Ontology Based Data Access (OBDA) is an research are that focuses on accessing data through ontologies. –ontop-’s focus is on SQL DBs (RDBMs)

Benefits: Flexible data model (OWL/RDF) Flexible query language (OWL or SPARQL) Inference Speed, volume and features (by reusing SQL DBs)

Possible applications Semantic Query Answering Data integration Semantic Search

+Two approaches for OBDAExtract Transform Load (ETL)

Reasoner

Source

Application

TBox

Inputs

Data CodeData is transformed into OWL ABox assertions that are combined with OWL axioms and then given to a reasoner or query engine.

Limitations: performance and memory.

+Two approaches for OBDAOn-the-fly

Reasoner

Source

Application

Ontology

Mappings

Input

Mappings are axioms that relate the data in a RDBMs to the vocabulary of the ontology (the classes and the properties), they “connect” the two vocabularies in a sense.

The input are ontology and mappings, the reasoner answers the queries by transforming them into queries over the source. The reasoner is connected to the source, data is not duplicated, is always up-to-date.

+

ontop is a platform to query RDBMs through OWL/RDFS ontologies on-the-fly, using SPARQL. It's extremely fast and is packed with features

It’s composed by 2 main components: Quest. A reasoner/query engine that is able to answer SPARQL 1.0

queries, supports OWL 2 QL inference and a powerful mapping language. Can be run in Java applications or a stand-alone SPARQL server.

ontopPro. A plugin for Protégé 4 that provides a mapping editor, and that allows to use Quest directly from Protégé.

Today we will focus learning to use OBDA with ontopPro

+Part 2 ontopPro: The basicsSQL, Mappings, Queries

+Overview

Flash recap of SQL DBs with H2

Using ontopPro Connecting a DB with

Protégé Creating mappings Querying

About mappings in ontop

About query answering in ontop

+An SQL database, H2

A pure java SQL database

Easy to install Just unzip the downloaded package, already in your USB stick h2-

simple.zip

Easy to run, just run the scripts: Open a terminal (in mac Terminal.app, in windows run cmd.exe) Move to the H2 folder (e.g., cd h2) Start H2 using the h2 scripts

sh h2.sh (in mac/linux) h2w.bat (in windows)

starting H2 with the terminal

A yellow icon indicates the DB is running (right click for a context menu)

jdbc:h2:tcp://localhost/testjdbc:h2:tcp: = protocol informationlocahost = server locationtest = database name

+Creating the database

We’ll create a table to store lung cancer information as follows:

patientid name type stage

1 Mary false 2

2 John true 7

type is: • true for Non-Small Cell

Lung Cancer (NSCLC)• false for Small Cell Lung

Cancer (SCLC)

stage is:

• 1-6 for stage I,II,III,IIIa,IIIb,IV NSCLC

• 7-8 for Limited,Extensive SCLC

+Creating the table

To create the table (file patient-table1.sql):

CREATE TABLE tbl_patient ( patientid INT NOT NULL

PRIMARY KEY, name VARCHAR(40), type BOOLEAN, stage TINYINT

)

+Inserting the data

To insert the data:

INSERT INTO tbl_patient (patientid,name,type,stage) VALUES (1,'Mary',false,2),(2,'John',true,7);

+Retrieving data: SQL

To retrieve all data: SELECT * FROM TBL_PATIENT ;

+Other relevant queries

To get all id’s of patients with NSCLC

SELECT patientid FROM TBL_PATIENT WHERE TYPE = false

To get all information about patients with NSCLC and stage 3 or above

SELECT patientid FROM TBL_PATIENT WHERE TYPE = false AND stage >= 2

+First ontop mapping

Objective That each row generates the following OWL data

An OWL individual of the form::db1/1

OWL assertions of the form:ClassAssertion( :Person :db1/1 )DataPropertyAssertion ( :id :db1/1 “1”)DataPropertyAssertion ( :name :db1/1 “Mary”)DataPropertyAssertion ( :type :db1/1 “false”)DataPropertyAssertion ( :stage :db1/1 “2”)

That is, we define a vocabulary of Classes and Properties that we want to “populate” using the data from the database.

A direct mapping

+A direct mapping (cont.)

Seen graphically:

Things to note: The OWL object is identified by

an IRI Values have OWL data types

patientid name type stage

1 Mary false 2

+Step 0: Starting Protégé+ontop

Unzip the protégé-ontop bundle from your materialThis is a Protégé 4.3 package that includes the ontop plugin

Run Protégé using the run.bat or run.sh scripts. That is, execute:

cd Protege_4.3_ontopPro/sh run.sh

+Step 1: Defining the base URI

Define the ontology base URI: http://example.org/

Save the ontology

Close and re-open Protégé (Sorry this is due to a bug)

Enable the OBDA Model tab inWindow -> Tabs

+Step 2: Add the datasource

Using the OBDA model tab, we now need to define the connection parameters to our lung cancer database

Steps: 0. Switch to the OBDA model tab 1. Add a new data source (give it a name, e.g., LungCancerDB) 2. Define the connection parameters as follows:

Connection URL: jdbc:h2:tcp://localhost/test Username: sa Password: (leave empty) Driver class: org.h2.Driver (choose it from the drop down menu)

3. Test the connection using the “Test Connection” button

+

A “Connection is OK” means Protégé and ontop were able to connect to our H2 server and see the “tests” DB we just created. We are now ready to add the mappings for the DB.

+Step 3: Create a mapping

Add the class:http://example.org/Patient

Switch to the “Mapping Manager” tab in the OBDA Model tab.

1. Select the LungCancerDB source

2. Add a mapping with ID “patient-map”

target: :db1/{PATIENTID} a :Patient .source: SELECT * FROM TBL_PATIENT

NOTE: use upper case

+Adding a Mapping

Select the LungCancerDB from the drop down menu.

Click the “Create Button”

+

The “Assertion template” a.k.a. “triple Template” tells ontop how to create URI’s and Class and Property assertions using the data from the DB (from the SQL query)

+The meaning of mappings

Mappings + DB data “entail (consequence)” OWL data, i.e., OWL ABox assertions.

These “entailed” data is accessible during query time (the on-the-fly approach) or can be imported into the OWL ontology (the ETL approach)

+The meaning of mappings

ontop’s main way to access data is on-the-fly, however, you can also do ETL using the “import data from mappings” function in the “ontop” menu.

Do it now and explore the result in the “individuals tab”, when done remember to delete these individuals.

Use with care, you may run out of memory.

+On-the-fly access to the DB

This is the main way to access data in ontop and its done by querying ontop with SPARQL.

The “query engine”/”reasoner” that comes with ontop is called “Quest”

Enable Quest in the “Reasoner” menu

+On-the-fly access to the DB

Next, enable the “OBDA query” tab (ontop SPARQL) in the tabs menu

+Querying with Quest

In the OBDA Tab:

1. Write the SPARQL query

2. Click execute

3. Inspect the results

TEMPLATE: :db1/{PATIENTID} a :Patient .

The result is no longer numeric ID’s in the database. The results are URI’s constructed in the way that you wrote in the mapping by replacing the “column references” with the actual values obtained from the database (the values in each row)

+The rest of the patient mappings

Add the following Data Properties: :id

:name :type :stage

target: :db1/{PATIENTID} :id {PATIENTID} .source: SELECT * FROM TBL_PATIENT

target: :db1/{PATIENTID} :name {NAME} .source: SELECT * FROM TBL_PATIENT

target: :db1/{PATIENTID} :type {TYPE} .source: SELECT * FROM TBL_PATIENT

To complete the model we can add the following mappings one by one and “synchronize” the reasoner:

target: :db1/{PATIENTID} :stage {STAGE} .source: SELECT * FROM TBL_PATIENT

OR…

+The rest of the patient mappings

Add the following Data Properties: :id

:name :type :stage

target: :db1/{PATIENTID} a :Patient ;:id {PATIENTID} ;:name {NAME} ;:type {TYPE} ;:stage {STAGE} .

source: SELECT * FROM TBL_PATIENT

Or, you can modify the original mapping as follows so thatit generates multiple assertions at the same time:

Don’t forget to synchronize with the reasoner…

+About Mappings

A mapping represents OWL assertions, one set of OWL assertions for each result row returned by the SQL query in the mapping. The assertions that the mapping represents are those obtained by replacing the place holders with the values from the DB.

Mappings are composed by: Mapping IDs are arbitrary names for each mapping (choose

something that allows you to identify the mapping) The “Source” of the mapping is an SQL query that retrieves some of

the data from the database. The “Target” of the mapping is a form of “template” that indicates

how to generate OWL Assertions (class or property) in a syntax very close to “Turtle” syntax for RDF.

+Assertion Template Examples

Assertion templates are formed as a triple

“subject predicate object”

The subject is always a URI, the object maybe another URI or an OWL value.

Class Assertions use rdf:type or a as predicate, and a URI as object (the class name) e.g.,

:db1/{id} rdf:type :Person <http://live.dbpedia.org/page/{name}> a :Writer

Object/Data Property Assertion have any URI as predicate (the property URI) and a URIs or OWL Value as object

:db1/{id} :name {NAME}:db1/{id} :age {C1}^^xsd:string:db1/{id} :knows :db1/{id2}:db1/{id} :knows :Michael_Jackson

+Practical Notes About Mappings

With ontopPro, mapping and data source definitions are stored in .obda files

.obda files are located in the same folder as the .owl ontology

They should be named as the .owl file

.obda files are text files, they may be edited and created manually, this can be more convenient in several cases, e.g., automatically generating large amounts of mappings, quick refactoring using regular expressions, etc.

+About Query Answering in Ontop

ontop’s query engine uses “query rewriting” techniques

Given a SPARQL query, ontop translates it into an SQL query using the mappings (and the ontology). You can get the SQL query generate by ontop using the context menu in the OBDA query tab.

+About Query Answering in Ontop

Key features: Volume: By relying on SQL DBs, the datasets that ontop can handle

are in the GBs and TBs Fast: ontop generates efficient SQL queries, that when combined

with a fast SQL engine to provide answers in ms. Not all SQL queries are fast, most of the research and development efforts in ontop go towards generating FAST SQL queries

Possible drawbacks: Maturity: SPARQL support in ontop is under development and

many features are still missing Know-how: SQL expertise will be required to obtain the best

performance with large datasets

+Part 3Domain Modeling in OBDA

+

The OWL vocabulary so far is a one-to-one reflection of the database, not very interesting or useful

We would like: Application independent

vocabulary Vocabulary beyond the

one explicit in the DB Individuals and relations between them

that reflect our understanding of the domain

For example…

Application independent mappings

+Redesigning our model

Highlights:• The vocabulary is more domain oriented• No more values to encode types or stages. There is

a new individual :db1/neoplasm/1 that stands for the cancer (tumor) of Mary and it is an instance of the class :NSCLC. There are URI’s (individuals) that represent the stage of the cancer

This model is closer to the formal model of the domain, independent from the DB. Later, this will allow us to easily integrate new data or domain information (e.g., an ontology).

+Constructing the new model 1

Remove the old mappings and vocabulary, then:

Create the new vocabularyObject Properties: :hasStage, :hasNeoplasm

Classes::SCLC, :NSCLC

Add mappings for the new classes and properties as follows:

+Basic mappings

Basic mapping to generate the patient individual as well as the new “neoplasm” for that individual.

+Classifying the neoplasm

Now we classify the neoplasm individual using our knowledge of the database.

We know that “false” in the table patient indicates a “Non Small Cell Lung Cancer”, so we classify the neoplasm as a :NSCLC. Similar for :SCLC

+Associating a stage

We associate the neoplasms of each patient to a stage. Note that the stage is no longer an arbitrary value, but a constant URI with clear meaning, independent from the DB.

+Querying the new model

In the new model now we can obtain the information of each patient and their condition through URIs of classes or individuals that have clear semantics, not DB dependent. We are using a “global vocabulary”.

This will allow us to easily integrate new DBs and a domain ontology…

+Part 4Data integrationIntegration by alignment

+Data integration in OBDA

Even if two databases contain data about the same domain, integrating the data is often problematic since the data may be represented in different ways

However, if proper modeling is used, integrating multiple data sources using OBDA may become simple: Insert the data in the database (either as new tables, or through

database federation) Create the new mappings for the source such that they match the

“global vocabulary” Query using the global vocabulary as usual

Consider for example…

+A different Lung Cancer DB

Consider a new lung cancer DB as follows (create it NOW in H2 using the commands in patient-table2.sql)

ID name ssn age

1 Mariano SSN1 33

2 Morgan SSN2 45

ID stage

1 i

ID stage

2 limited

T_NAME

T_NSCLC

T_SCLC

In this DB information is distributed in multiple tables. Moreover, the way in which meaning is encoded is different. In particular, • The type of cancer is separated by

table• The stage of cancer is text

(i,ii,iii,iiia,iiib,iv, limited, extensive)

Moreover, the IDs of the two DBs overlap (ID 1 is a different patient here, not Mary) and ssn and age do not exist in the DB1

+Basic mappings 2

The URI’s for the new individuals differentiate the data sources (db2 vs. db1)

Being an instance of NSCLC and SCLC depends now on the table, not a column value You can find this mappings in lung-

cancer3-4tables.obda

+Stage mappings 2

The new mappings reflect what we knowof this new data source.

+The integration result

Now, using a single SPARQL query, we can query both data sources independently of their structure; they have been aligned to a global view.

+However

Multiple sources maybe have different properties

We cannot know before hand if we don’t know the sources and the only thing you see is the ontology

This can be a BIG issue for the user of our integrating ontology, since many queries would be empty, e.g.:

SELECT ?x ?y ?z WHERE { ?x a :Person ; :name “Mary” ; :ssn ?y ; :age ?z .}

This query is empty because Mary is from DB1 and individuals from DB1 have no SSN or AGE. Similar problems arise with SQL DBs.

But, in SPARQL we have DESCRIBE…

+Flexible queries with SPARQL

“Retrieve all information about individuals named ‘Mary’ and all information about all conditions they have”

PREFIX : <http://example.org/>

DESCRIBE ?x WHERE { {?x :name ”Mary" .} UNION { ?y :name ”Mary"; :hasNeoplasm ?x } }

+Flexible queries with SPARQL

“Retrieve all information about individuals named ‘Mary’ and all information about all conditions they have”


DESCRIBE ?x WHERE { {?x :name ”Mariano" .} UNION { ?y :name ”Mariano"; :hasNeoplasm ?x } }

+OBDA for Data Integration

Key features of on-the-fly OBDI (ontology-based data integration) with ontop: Flexible: Mapping and ontology languages are powerful enough to

accommodate most needs (consider that SQL allows even to transform the data, make calculations, etc.)

Dynamic: Changes in the data are automatically reflected during query answering. Through DB federation new databases can be incorporated easily

Possible drawbacks: Performance: With large volumes of data (hundreds of thousands

of rows) performance may suffer (depends on the DB engine, indexes, and other SQL related issues)

+Data integration resources

You may integrate any JDBC resource, here go some interesting options:

Teiid – can integrate different DB SQL dbs and other types of documents (XML, Excel, etc.)http://www.jboss.org/teiid/

Oracle database links – integrates Oracle DBshttp://docs.oracle.com/cd/B28359_01/server.111/b28310/ds_concepts002.htm

MySQL Federated tables – integrates MySQL dbshttp://dev.mysql.com/doc/refman/5.0/en/federated-storage-engine.html

Excel as SQL – Integrates Excel spread sheetshttp://sourceforge.net/projects/xlsql/

http://www.jboss.org/teiid/

http://docs.oracle.com/cd/B28359_01/server.111/b28310/ds_concepts002.htm



http://dev.mysql.com/doc/refman/5.0/en/federated-storage-engine.html

http://dev.mysql.com/doc/refman/5.0/en/federated-storage-engine.html

http://sourceforge.net/projects/xlsql/

http://sourceforge.net/projects/xlsql/

+Part 5Ontologies and ontop

+Domain knowledge

Up to know, we only have “explicit data”, however, combining data with domain knowledge (ontology) we can enrich our queries with “implicit data”.

For example, that NSCLC is a kind of malignant tumor (neoplasm), that having a neoplasm is a kind of condition, etc.

This knowledge can be expressed using OWL axioms, which ontop will use during query answering.

+Lung Cancer knowledgeTerminology Knowledge (TBox)

Plus:

ObjectPropertyRange( :hasStage :LungCancerStage )

+The result

After synchronization, all the implied information is available during query answering


DESCRIBE ?x WHERE { {?x :name "Mariano"} UNION { ?y :name "Mariano" ; :hasNeoplasm ?x }}

+Our data before (only explicit)

Our existing data looked like this picture. With the new axioms, it now looks like this (next slide):

+Our data now (implicit and explicit)

Recall, in the on-the-fly approachall this information is available at query time but not really stored anywhere

+Domain Knowledge

Large amounts of data “belong” in databases, i.e., it changes fast, application specific, large volumes, etc.). OBDA allows you to do keep it in the DB, but…

Some data in the domain does belong on the ontology side, i.e., static information, independent from the application. For example:

ABox data

+Domain Knowledge

This information usually given in the form of OWL individual assertions (ABox).

A unique feature of ontop is its ability to mix these two worlds, Allows to link virtual individuals to real individuals to achieve things like:

ABox data

We want to do this for all individuals in db1 and db2!

+Hybrid ABoxes: How?

Add the individuals ABox assertions to your ontology (6 individuals and 6 ABox assertions)

+Hybrid ABoxes: How?

Add the individuals ABox assertions to your ontology (6 individuals and 6 ABox assertions)

Add mappings that link your “virtual” individuals to the real ones

+Hybrid ABoxes

+Hybrid ABoxes

PREFIX : <http://example.org/>DESCRIBE ?x WHERE { ?x :age ?age ; :recordIn [ a :ResearchCenter; :locatedIn [ :partOf :usa ] ] FILTER (?age > 40)}

+Notes about reasoning in ontop

ontop can only understand OWL 2 QL axioms, that’s: subClassOf, subPropertyOf, equivalence InverseOf Domain and Range Plus some limited forms of qualified existential restrictions

Any axiom that is not understood by ontop is ignored while reasoning

Reasoning is also done by means of query rewriting (no data moves from the database). Again, most of our research goes into generating efficient SQL.

+

ConclusionsPointers and Final thoughts

+Other features of ontop

Mapping Assistant (OBDA model tab) - A view to help you generate custom mappings quickly

Mapping bootstrapping (OBDA menu) – automatically generate “direct” mappings (actually the first mappings we created can be generated automatically with this function)

Mapping materialization (OBDA menu) – generate OWL assertions from mappings with one click (import). Try it now, all the data will be available in the “Individuals tab” and you can now use it with any reasoner

SPARQL end-point – Use the mappings and ontology independently from Protégé, as a SPARQL server

+Other features of ontop

OWLAPI and Sesame– Once you created the ontology and mappings, program your application with ontop and Java using these.

Command line tools – All previous features can be used directly from the command line with ontop scripts

R2RML mappings – Ontop now supports also R2RML mappings (http://www.w3.org/TR/r2rml/) the expressive power of these more or less the same, however our syntax is more user friendly ☺, use R2RML for mapping exchange

JDBC sources – Ontop can support any JDBC data source. This means not only RDBMs, but anything that can be seen as a RDBMs and queried with SQL, currently there are many wrappers that allow to do this for Excel files, XML documents, etc.

http://www.w3.org/TR/r2rml/

http://www.w3.org/TR/r2rml/

+Disclaimer

Although the code of ontop is evolving fast, there are several (Sept/13) important issues to consider when using ontop:

Datatypes many data types not supported yet, issues with dateTime

SPARQL Current target SPARQL 1.0 plus most features of 1.1 (no paths). From SPARQL 1.0 we still miss several built in functions

SQL issues Some issues with SQL and some DBs, e.g., problems getting DB metadata, issues with caps and quotes to qualify column names

SQL Optimization Performance good , but could be better. Many planned optimizations not yet implemented.

GUI/ontopPro Many bugs in the GUI (we focus on the DB aspect)

+Additional material

Ontop’s website http://ontop.inf.unibz.it

Ontop’s documentation https://babbage.inf.unibz.it/trac/obdapublic/wiki

Ontop’s source code https://github.com/ontop/ontop/

Since August’13 ontop is a open source (AGPL). Consider contributing!

Ontop’s google grouphttps://groups.google.com/d/forum/ontop4obda

http://ontop.inf.unibz.it/

https://babbage.inf.unibz.it/trac/obdapublic/wiki

https://babbage.inf.unibz.it/trac/obdapublic/wiki

https://github.com/ontop/ontop/

https://github.com/ontop/ontop/

https://groups.google.com/d/forum/ontop4obda

https://groups.google.com/d/forum/ontop4obda

+

Thank you

Education

ontop: A tutorial