Web Science - ISoLA 2012

Using OWL Domain Models as Abstract Workflow Models

Or...Conducting in silico research in the Web

from hypothesis to publication

Mark Wilkinson

Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain

Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

Context

“While it took 2,300 years after the first report of angina for the condition to be commonly taught in medical curricula, modern discoveries are being disseminated at an increasingly rapid pace. Focusing on the last 150 years, the trend still appears to be linear, approaching the axis around 2025.”

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009

Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

http://www.amazon.com/s/ref=ntt_athr_dp_sr_1?_encoding=UTF8&sort=relevancerank&search-alias=books&ie=UTF8&field-author=Tony%20Hey



The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

“The Singularity”

The X-intercept is where, the moment a discovery is made, it is immediately put into practice

(not only medical practice, but any research endeavour...)




The technology required to achieve this

does not yet exist

Scientific research would have to be conducted within a medium that

immediately interpreted and disseminated the results...

You Are

Here

...in a form that immediately (actively!) affected the research of others...

You Are

Here

...without requiring them to be aware of these new discoveries.

You Are

Here

To achieve this vision

We must learn how to do research IN the Web

Not OVER the Web

How we use the Web today

To achieve this vision

We must learn how to do research IN the Web

Not OVER the Web

I’d like to show you how close we now are to this vision

and how we got there

Web Science 2.0

We wanted to duplicatea real, peer-reviewed, bioinformatics analysis

simply by building a model in the Webdescribing what the answer

(if one existed)

would look like

...the machine had to make every other decision

on it’s own

This is the study we chose:

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

Original Study Simplified

Using what is known about interactions in fly & yeast

predict new interactions with your human protein of interest

Given a protein P in Species X

Find proteins similar to P in Species Y

Retrieve interactors in Species Y

Sequence-compare Y-interactors with Species X genome

(1) Keep only those with homologue in X

Find proteins similar to P in Species Z

Retrieve interactors in Species Z

Sequence-compare Z-interactors with (1)

Putative interactors in Species X

Abstracted

Modeling the answer...

OWL

Web Ontology Language (OWL) is the language approved by the W3C

for representing knowledge in the Web


Note that every word in this diagram is, in reality, a URL (because it is OWL)


The model of a Potential Interactor is published in The Web

It utilizes concepts from other models published in The Web (ours and other’s) by referencing their URLs


The model of a Potential Interactor is a network of concepts distributed within the Web

It will be affected by changes to those concepts

We do not “own” all of those concepts!

ProbableInteractor is homologous to ( Potential Interactor from ModelOrganism1…)

and

Potential Interactor from ModelOrganism2…)

Probable Interactor is defined in OWL as a subclass of Potential Interactor that requires homologous pairs of interacting proteins to exist in both

comparator model organisms.

(Effectively, an intersection)


Publish our OWL model of a Probable Interactor

in the Web

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly

Running a Web Science 2.0 Experiment

The tricky bit is...

In the abstract, the search for homology is “generic” – ANY model

organism.

But when the machine attempts to do the

experiment, it will have to use several different and specific resources because our question specifies two different

speciestaxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly

PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>

SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE {

?protein a i:ProbableInteractor . }

This is the question we ask:(the query language here is SPARQL)

The reference (URL) to our OWL model of the answer

Our system then derives (and executes) the following workflow automatically

These are differentWeb services!

...selected at run-time based on the same model

There are three very cool things about what you just saw...


The system was able to create a workflow based on an OWL model (ontology)


The system was able to create a COMPUTATIONAL workflow

based on a BIOLOGICAL model


The workflow it created (i.e. the services chosen)

differed depending on context

taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly

We got the answer

“simply” by designing a model of the answer!

How did we do that?

Design Pattern forWeb Services on the Semantic Web

A Web application that answers SPARQL-DL queries

Query-answering Enhanced by SADI

Demos of SADI and SHARE

What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .

?image info:hasDescription ?desc }

What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .

?image info:hasDescription ?desc }

Note that there is no “FROM” clause!We don’t tell it where it should get the information, The machine has to figure that out by itself...

Enter that query into SHARE

Click “Submit”...

SHARE examines available SADI Web Services...and in a few seconds you get your answer.

The query results are live hyperlinksto the respective Database or images

(the answer is IN the Web!)

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}




}




}

Note again that there is no “From” clause…

I have not told SHARE where to look for the answer, I am simply asking my question


Two different providers of gene information (KEGG & NCBI); were found & accessed

Two different providers of pathway information (KEGG and GO); were found & accessed

The results are all links to the original data(The answer is IN the Web!)

Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants

(I showed you this query in ISoLA 2010… sorry for repeating myself )

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants

(I showed you this query in 2010… sorry for repeating myself!)

Likely Rejecter:

A patient who has creatinine levelsthat are increasing over time

- - Mark D Wilkinson’s definition

Likely Rejecter:

…but there is no “likely rejecter” column or table in our database…

only blood chemistry measurementsat various time-points

Likely Rejecter:

So the data required to answer this questionDOESN’T EXIST!

?


SHARE “decomposes” theLikely Rejector OWL class

into its constituent property restrictions

Each property restriction in the Classis matched with a SADI Service

The matched SADI Service can generate data that has that property

SHARE chains these SADI services are into a workflow...

...the outputs from that workflow are Instances (OWL Individuals) of the Likely Rejector OWL Class

For example… SHARE utilizes SADI to discover analytical services on the Web that do linear regression analysis;

required for the “increasing over time” part of the Class definition

VOILA!

SHARE examines the OWL Class

Gathers, from the Web, the ontologies that are referenced by that Class

then uses those ontological properties to identify which data-sources and analytical

tools it must access to create data matching that Class definition

OWL

The way SHARE builds the workflow varies depending on the context of the query

(i.e. which data/ontologies it reads – Mine? Yours?)

and on what part of the query it is trying to answer at any given moment

(which ontological concept is relevant to that clause)

And that brings us back to...

Web Science 2.0

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

derives and executes the following workflow automaticallyusing an OWL ontology that describes the biology

The analytical tools chosen for that workflow were determined based on

context

even though the biological (ontological) model driving their selection was the

same

i.e.

The published model is re-usable

i.e.

The published model is re-usable

In different contexts... by different researchers

Because the model IS the experiment

the published EXPERIMENT is re-usable!!

Simply point the same query at your own dataset...

The

scientific publication

is an

executable document!

Every component of the model

Every component of the input data

Every component of the output data

is a URL

Therefore the model, the question, the experiment, and the results

are inherently IN the Web

Every component of the model

Every component of the input data

Every component of the output data

is a URL

The answer, and the knowledge derived from it, is immediately available to Web search engines

and moreover, can instantly affect the outcome of other Web Science experiments

You Are NowHere!!!

Change the way we think of “hypotheses”

In Web Science 2.0

Model what the world would “look like”if your hypothesis were true

Then ask “is there any data that fits that model?”

Please join us!

SADI and SHARE are Open-Source projects

http://sadiframework.org

http://sadiframework.org/

My New Home!

Luke McCarthy – Lead Dev.Everything...

Benjamin VanderValk SHARE & SADI & Experimental modeling & myHeath Button

Soroush Samadian Cardiovascular data modeling and queries

University of British Columbia

Edward Kawas SADI Service auto-generator

Ian WoodExperimental modeling project

U of New Brunswick

Dr. Chris BakerAlexandre Riazanov

Carleton University

Dr. Michel DumontierMarc-Alexandre NolinLeonid ChepelevSteve EtlingerNichaella KiethJose Cruz

C-BRASS Collaborators at other sites

Microsoft Research

Technology

Web Science - ISoLA 2012